GitLab will be switched during April/May 2023 Datacenter Switchback from codfw to eqiad (one week after the actual switchover, to not block dependencies). This task tracks the failover of the GitLab production instance in codfw (gitlab2002) to eqiad (gitlab1004).
Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: 09:00 UTC, May 2nd 2023
Checklist:
Preparations before downtime:
- prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1004, set gitlab1004 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab2002 operations/puppet/+/912881
- Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab1004, and gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/912972
- apply gitlab-settings to gitlab1004 and gitlab2002
- announce downtime some days ahead on ops/releng list/broadcast message
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab2002 --switch-to gitlab1004 -t T335504
- When prompted, merge the puppet change prepared above operations/puppet/+/912881
- When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions -- operations/dns/+/912972
Falling back to manual steps:
If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:
- Announce downtime in #wikimedia-gitlab
- pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
- downtime gitlab2002 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1004 - T329931" -M 120 'gitlab2002.wikimedia.org'
- stop puppet on gitlab2002 with sudo disable-puppet "Running failover to gitlab1004 - T329931"
- stop GitLab on gitlab2002 with gitlab-ctl stop nginx
- stop ssh-gitlab daemon on gitlab2002 with systemctl stop ssh-gitlab
- create full backup on gitlab2002 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on gitlab2002 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1004.wikimedia.org/data-backup
- merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1004 and run puppet operations/puppet/+/912881
- trigger restore on gitlab1004 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
- merge change to point DNS entry for gitlab.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/912972
- verify installation
- enable puppet on gitlab2002 with sudo run-puppet-agent -e "Running failover to gitlab1004 - T329931"
- start ssh-gitlab daemon on gitlab2002 with systemctl start ssh-gitlab
- unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
- announce end of downtime