Page MenuHomePhabricator

Gerrit failover process
Open, In Progress, HighPublic

Description

A part of the effort to standardize failover procedures for Collab services.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+5 -2
operations/cookbooksmaster+0 -8
operations/cookbooksmaster+6 -4
operations/cookbooksmaster+12 -7
operations/puppetproduction+2 -2
operations/cookbooksmaster+5 -3
operations/cookbooksmaster+1 -1
operations/puppetproduction+0 -2
operations/dnsmaster+8 -8
operations/puppetproduction+70 -33
operations/cookbooksmaster+12 -6
operations/cookbooksmaster+4 -4
operations/cookbooksmaster+7 -43
operations/cookbooksmaster+144 -0
operations/dnsmaster+8 -8
operations/puppetproduction+70 -33
operations/puppetproduction+70 -33
operations/software/gerritdeploy/wmf/stable-3.10+114 -0
operations/software/gerritdeploy/wmf/stable-3.10+4 -4
operations/software/gerritdeploy/wmf/stable-3.10+5 -3
operations/cookbooksmaster+9 -10
operations/cookbooksmaster+98 -97
operations/dnsmaster+8 -8
operations/puppetproduction+5 -3
operations/cookbooksmaster+5 -3
operations/cookbooksmaster+24 -9
operations/puppetproduction+2 -2
operations/puppetproduction+46 -30
operations/puppetproduction+19 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -6
operations/puppetproduction+19 -0
operations/puppetproduction+5 -3
operations/dnsmaster+8 -3
operations/cookbooksmaster+64 -218
operations/puppetproduction+18 -21
operations/puppetproduction+25 -0
operations/puppetproduction+1 -1
operations/puppetproduction+11 -2
operations/software/gerritdeploy/wmf/stable-3.10+3 -0
operations/cookbooksmaster+1 -1
operations/cookbooksmaster+52 -150
operations/cookbooksmaster+304 -0
operations/software/gerritwmf/stable-3.10+4 -0
operations/software/gerritwmf/stable-3.10+4 -0
operations/software/gerritdeploy/wmf/stable-3.10+0 -0
operations/puppetproduction+3 -2
operations/puppetproduction+1 -1
operations/puppetproduction+16 -8
operations/puppetproduction+1 -0
operations/puppetproduction+5 -3
operations/dnsmaster+8 -8
operations/puppetproduction+3 -3
operations/dnsmaster+8 -8
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
In ProgressABran-WMF
OpenNone
ResolvedABran-WMF
ResolvedABran-WMF
In ProgressABran-WMF
ResolvedDzahn
ResolvedMatthewVernon
ResolvedLSobanski
ResolvedABran-WMF
OpenABran-WMF
ResolvedLSobanski
ResolvedABran-WMF
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T400971 it was noticed that nftables connection tracking/metering might change when switching from bullseye to bookworm to a newer nftables version. So before switching the hosts we should make sure reasonable thresholds are configured. Alternatively we can monitor the denylist after the switchover closely.

Change #1178172 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add spare fqdn to apache vhost

https://gerrit.wikimedia.org/r/1178172

Change #1178172 merged by Arnaudb:

[operations/puppet@production] gerrit: add spare fqdn to apache vhost

https://gerrit.wikimedia.org/r/1178172

In T400971 it was noticed that nftables connection tracking/metering might change when switching from bullseye to bookworm to a newer nftables version. So before switching the hosts we should make sure reasonable thresholds are configured. Alternatively we can monitor the denylist after the switchover closely.

thanks for highlighting this blocking issue @Jelto, with T400971: Troubleshoot GitLab nftables throttling after switchover we managed to convert the throttling logic to something properly interpreted by both nftables versions.

Change #1170433 merged by Dzahn:

[operations/puppet@production] gerrit: replace host names in replica config with variables

https://gerrit.wikimedia.org/r/1170433

Change #1188351 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: Switchover gerrit1003 → gerrit2003

https://gerrit.wikimedia.org/r/1188351

Change #1172625 abandoned by Arnaudb:

[operations/puppet@production] gerrit: Switchover gerrit1003 → gerrit2003

Reason:

replaced by 1188351

https://gerrit.wikimedia.org/r/1172625

Change #1191431 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: bugfixes on failover

https://gerrit.wikimedia.org/r/1191431

Change #1191431 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: bugfixes on failover

https://gerrit.wikimedia.org/r/1191431

the dry run switchover output looks alright {P83480}

I've added the entry to the deployment calendar - next switchover is scheduled on next Monday (Oct. 6)

Change #1193017 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit@deploy/wmf/stable-3.10] Add a banner for a Gerrit switch over maintenance

https://gerrit.wikimedia.org/r/1193017

On top of your wikitech-l announce, I have proposed a change to add a banner at the top of the Gerrit web UI that will looks like:

gerrit_switch_over_20251006.png (76×952 px, 19 KB)

Follows up on https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/1193017

Change #1193017 merged by jenkins-bot:

[operations/software/gerrit@deploy/wmf/stable-3.10] Add a banner for a Gerrit switch over maintenance

https://gerrit.wikimedia.org/r/1193017

Mentioned in SAL (#wikimedia-operations) [2025-10-02T08:35:34Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833

Mentioned in SAL (#wikimedia-operations) [2025-10-02T08:35:40Z] <hashar@deploy2002> deploy aborted: Add a banner for a Gerrit switch over maintenance - T387833 (duration: 00m 00s)

Mentioned in SAL (#wikimedia-operations) [2025-10-02T08:35:45Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833

Mentioned in SAL (#wikimedia-operations) [2025-10-02T08:35:52Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833 (duration: 00m 12s)

Change #1193051 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: typo on --systemd arg

https://gerrit.wikimedia.org/r/1193051

Change #1193051 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: typo on --systemd arg

https://gerrit.wikimedia.org/r/1193051

Change #1193082 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] gerrit: switchover from gerrit1003 to gerrit2003

https://gerrit.wikimedia.org/r/1193082

Change #1193590 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: add a local backup cookbook

https://gerrit.wikimedia.org/r/1193590

Change #1193599 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: remove localbackup logic from failover

https://gerrit.wikimedia.org/r/1193599

Change #1193082 merged by Arnaudb:

[operations/dns@master] gerrit: switchover from gerrit1003 to gerrit2003

https://gerrit.wikimedia.org/r/1193082

Change #1188351 merged by Arnaudb:

[operations/puppet@production] gerrit: Switchover gerrit1003 → gerrit2003

https://gerrit.wikimedia.org/r/1188351

I tried to extract the error from the failed cookbook execution from the cumin log:

2025-10-06 12:29:12,789 arnaudb 1516308 [WARNING cookbooks.sre.gerrit.failover:337 in sync_files] There will be a sync retry, expect no more than 10 of these.
2025-10-06 12:29:12,790 arnaudb 1516308 [DEBUG spicerack.remote:750 in _execute] Executing commands ['/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/'] on 1 hosts: gerrit1003.wikimedia.org
2025-10-06 12:29:12,791 arnaudb 1516308 [INFO cumin.transports.clustershell.ClusterShellWorker:78 in execute] Executing commands [cumin.transports.Command('/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/')] on '1' hosts: gerrit1003.wikimedia.org
2025-10-06 12:29:12,798 arnaudb 1516308 [DEBUG cumin.transports.clustershell.SyncEventHandler:590 in ev_pickup] node=gerrit1003.wikimedia.org, command='/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/'
2025-10-06 12:29:14,019 arnaudb 1516308 [DEBUG cumin.transports.clustershell.SyncEventHandler:783 in ev_hup] node=gerrit1003.wikimedia.org, rc=23, command='/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/'
2025-10-06 12:29:14,019 arnaudb 1516308 [INFO cumin.transports.clustershell.SyncEventHandler:853 in ev_timer] Completed command '/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/'
2025-10-06 12:29:14,020 arnaudb 1516308 [ERROR spicerack._menu:292 in _run] Exception raised while executing cookbook sre.gerrit.failover:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
              ^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/gerrit/failover.py", line 250, in run
    self.sync_files(idempotent=True, all_dirs=True)
  File "/usr/lib/python3/dist-packages/wmflib/decorators.py", line 231, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/gerrit/failover.py", line 363, in sync_files
    self.switch_from_host.run_sync(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 556, in run_sync
    return self._execute(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 763, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
2025-10-06 12:29:14,027 arnaudb 1516308 [DEBUG spicerack.locking:234 in release] Releasing lock for key sre.gerrit.failover with ID ddcb30ae-82c5-462c-a63b-14f50d593d0f
2025-10-06 12:29:14,027 arnaudb 1516308 [DEBUG etcd.lock:23 in __init__] Initiating lock for /spicerack/locks/etcd with uuid 42039cefcf5449208340942e1a7afa7e
2025-10-06 12:29:14,027 arnaudb 1516308 [DEBUG etcd.client:582 in read] Issuing read for key /spicerack/locks/etcd with args {'recursive': True}
2025-10-06 12:29:14,032 arnaudb 1516308 [DEBUG etcd.lock:67 in acquire] Lock not found, writing it to /spicerack/locks/etcd
2025-10-06 12:29:14,032 arnaudb 1516308 [DEBUG etcd.client:471 in write] Writing 42039cefcf5449208340942e1a7afa7e to key /spicerack/locks/etcd ttl=15 dir=False append=True

thanks for that @Jelto I'll try and reproduce the error in a controlled environment

Change #1193860 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: fix ownership at rsync time

https://gerrit.wikimedia.org/r/1193860

Change #1193846 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003"

https://gerrit.wikimedia.org/r/1193846

Change #1193845 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003"

https://gerrit.wikimedia.org/r/1193845

Change #1193865 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: test 1193860

https://gerrit.wikimedia.org/r/1193865

Change #1193865 abandoned by Arnaudb:

[operations/cookbooks@master] gerrit: test 1193860

Reason:

test OK

https://gerrit.wikimedia.org/r/1193865

DRY-RUN: Executing commands ['/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit2/review_site  rsync://gerrit2003.wikimedia.org/gerrit-var-lib/ --no-o --no-g --chown=gerrit:gerrit '] on 1 hosts: gerrit1003.wikimedia.org
DRY-RUN: Releasing lock for key sre.gerrit.failover with ID abe63737-d8ed-498c-af9d-f71d5fe4d64c

I initially thought the issue came from target directory ownership, but it was a typo on the source path used for rsync. Fixed rsync has been tested in operations/cookbooks/+/1193865 and the fix has been backported to operations/cookbooks/+/1193860

Mentioned in SAL (#wikimedia-operations) [2025-10-07T08:37:55Z] <hashar> Stopped Gerrit on gerrit2003, deleted /srv/gerrit/git/* and restarted a full replication due to bad files ownership # T387833

Change #1193860 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: fix typo in source path

https://gerrit.wikimedia.org/r/1193860

Change #1194220 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit@deploy/wmf/stable-3.10] Disable motd banner: maintenance window has closed

https://gerrit.wikimedia.org/r/1194220

Change #1194220 merged by jenkins-bot:

[operations/software/gerrit@deploy/wmf/stable-3.10] Disable motd banner: maintenance window has closed

https://gerrit.wikimedia.org/r/1194220

Mentioned in SAL (#wikimedia-operations) [2025-10-07T15:03:16Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@21d2848]: Disable motd banner: maintenance window has closed - T387833

Mentioned in SAL (#wikimedia-operations) [2025-10-07T15:03:37Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@21d2848]: Disable motd banner: maintenance window has closed - T387833 (duration: 00m 30s)

Change #1194225 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit@deploy/wmf/stable-3.10] Disable component rather than motd plugin

https://gerrit.wikimedia.org/r/1194225

Change #1194225 merged by jenkins-bot:

[operations/software/gerrit@deploy/wmf/stable-3.10] Disable component rather than motd plugin

https://gerrit.wikimedia.org/r/1194225

Change #1193845 merged by Arnaudb:

[operations/puppet@production] Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003"

https://gerrit.wikimedia.org/r/1193845

Change #1193846 merged by Arnaudb:

[operations/dns@master] Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003"

https://gerrit.wikimedia.org/r/1193846

Change #1194932 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003"

https://gerrit.wikimedia.org/r/1194932

Change #1194931 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003"

https://gerrit.wikimedia.org/r/1194931

Change #1194949 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: local backup on source server only

https://gerrit.wikimedia.org/r/1194949

things are looking better now:

arnaudb@gerrit2003:git $ fd | wc -l
236218
arnaudb@gerrit2003:git $ pwd
/srv/gerrit/git
arnaudb@gerrit2003:git $ ls -l /srv/backup/
total 0

vs the previous situation:

gerrit 2003$ find /srv/gerrit/git |wc -l
5037253

That is 5 millions files.

[...]

 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
656136  Incr    3,745,079    54.98 G  OK       08-Oct-25 13:09 gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data

Change #1195432 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1195432

Change #1195437 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: add dry run rsync

https://gerrit.wikimedia.org/r/1195437

Change #1194949 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: local backup on source server only

https://gerrit.wikimedia.org/r/1194949

Change #1194932 merged by Arnaudb:

[operations/dns@master] Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003"

https://gerrit.wikimedia.org/r/1194932

Change #1194931 merged by Arnaudb:

[operations/puppet@production] Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003"

https://gerrit.wikimedia.org/r/1194931

Change #1196051 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: typo fix in post_sync_validation

https://gerrit.wikimedia.org/r/1196051

ABran-WMF closed subtask Restricted Task as Resolved.Tue, Oct 14, 12:43 PM

Change #1196227 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: ask the operator to merge puppet earlier

https://gerrit.wikimedia.org/r/1196227

Change #1195432 merged by Dzahn:

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1195432

Change #1196051 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: typo fix in post_sync_validation

https://gerrit.wikimedia.org/r/1196051

Change #1196629 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: disable gerrit service to enable backups

https://gerrit.wikimedia.org/r/1196629

Change #1196227 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: ask the operator to merge puppet earlier

https://gerrit.wikimedia.org/r/1196227

Change #1196684 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: rsync and chown fixes

https://gerrit.wikimedia.org/r/1196684

Change #1196694 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: stop puppet across all instances

https://gerrit.wikimedia.org/r/1196694

Change #1196695 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: stop stopping gerrit.service

https://gerrit.wikimedia.org/r/1196695

Change #1196629 merged by Dzahn:

[operations/puppet@production] gerrit: disable gerrit service to enable backups

https://gerrit.wikimedia.org/r/1196629

Here are the notes / commands from a gerrit failover in the past:

https://phabricator.wikimedia.org/P47782

Here is how we did the DNS change without having to merge while Gerrit is down:

  1. merge DNS change that removes gerrit-new and switches IP of gerrit.wikimedia.org - in web UI of gerrit(-old)
  2. run authdns-update on ns0.wikimedia.org, see the diff but do NOT commit yet
  3. disable puppet, stop gerrit, do the rsync, run chmod -R ...
  4. say "yes" to authdns-update and actually merge DNS change that removes gerrit-new and switches IP of gerrit.wikimedia.org

Change #1196792 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: unmask service & disable backup temporarily

https://gerrit.wikimedia.org/r/1196792

Here is how we did the DNS change without having to merge while Gerrit is down:

Thanks for the dig! I tweaked the process and the cookbook to be closer to this, the puppet merge timing was inconsistent with that approach.

Change #1196684 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: rsync and chown fixes

https://gerrit.wikimedia.org/r/1196684

Change #1196694 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: stop puppet across all instances

https://gerrit.wikimedia.org/r/1196694

Change #1196695 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: stop stopping gerrit.service

https://gerrit.wikimedia.org/r/1196695