Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Checklist:
Preparations before downtime:
- prepare the required Puppet changes change
- Prepare the required DNS changes change
- apply gitlab-settings to gitlab1004 and gitlab2002 change
- announce downtime some days ahead on ops/releng list/broadcast message
- make sure the daily backup and restore finished successfully on gitlab2002 and gitlab1004
- systemctl status full-backup.service
- systemctl status rsync-data-backup-gitlab1003.wikimedia.org.service
- systemctl status rsync-data-backup-gitlab1004.wikimedia.org.service
- systemctl status backup-restore.service
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab2002 --switch-to gitlab1004 -t T400252
- When prompted, merge the puppet change prepared above
- When prompted, merge the DNS change prepared above
- run authdns-update on the DNS master, following the DNS update instructions
- Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality
- Announce end of downtime
- copy missing packages
- disable restore on old host gitlab2002 in case anything is missing
Fallback checklist for manual steps available in T358567 or https://wikitech.wikimedia.org/wiki/GitLab/Failover#During_failover_(manual_steps).