GitLab replica will be switched in preparation for the March 2024 Datacenter Switchover. This task tracks the failover of the GitLab replica from gitlab1004 to gitlab1003.
Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: 2024-03-11
Checklist:
Preparations before downtime:
- prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1003, set gitlab1003 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab1004 (<patch link>)
- Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab1003, and gitlab-replica-old.wikimedia.org to gitlab1004 (<patch link>)
- apply gitlab-settings to gitlab1003 and gitlab1004 (<patch link>)
- announce downtime some days ahead on ops/releng list/broadcast message
- run a failover backup on the source host one day in advance sudo /srv/gitlab-backup/gitlab-backup.sh failover
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab1004 --switch-to gitlab1003 -t <ticket id>
- When prompted, merge the puppet change prepared above
- When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions
- ~Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality ~
- ~Announce end of downtime~
Falling back to manual steps:
If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:
- Announce downtime in #wikimedia-gitlab
- pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
- downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1003 - <ticket id>" -M 120 'gitlab1004.wikimedia.org'
- stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab1003 - <ticket id>"
- stop GitLab on gitlab1004 with gitlab-ctl stop nginx
- stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
- create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1003.wikimedia.org/data-backup
- merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1003 and run puppet
- trigger restore on gitlab1003 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
- merge change to point DNS entry for gitlab.wikimedia.org to gitlab1003 gitlab-replica-old.wikimedia.org to gitlab1004
- verify installation
- enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab1003 - <ticket id>"
- start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
- unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
- announce end of downtime