A part of the effort to standardize failover procedures for Collab services.
- Figure out how many hosts we want running at a given time (2 or 3)
- Schedule a meeting with Antoine and Tyler / check for dependencies on the RelEng side
- Is the replica upgraded to the latest version?
- Test how long it’ll take to set up a host from scratch
- Test a failover
- Automate failover: T260666: Create a cookbook to automate gerrit's switchover
- Test the automation
- Ensure Gerrit read-only plugin is available on all instances
- Simplify hiera lookup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129920
- Dry run tests
- Prepare next switchover patches
- Prepare nftable throttling (T387833#11057848)
- Dry run switchover
- Schedule
- Run
- re-enable backups (T406762: gerrit2003 is trying to backup incrementally 3.5 million files every hour, clogging backus and filling in available disk space)
- Ensure user migration is done (T338470: Rename gerrit2 unix user to gerrit and assign a fixed uid)
- update the Wikitech process with the new steps/nuances
- Reimage gerrit1003.wikimedia.org + reenable replication
- Reimage gerrit2002.wikimedia.org + reenable replication