GitLab will be switched during March 2023 Datacenter Switchover (T327920) from eqiad to codfw too (one day before the actual switchover, to not block dependencies). This task tacks the failover of the GitLab production instance eqiad (gitlab1004) to codfw (gitlab2002).
Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Last task: T307142#7971192 (checklist can be adapted for this years failover)
Time: 10:00 am UTC 27th of February
Checklist:
Preparations before downtime:
- check gitlab1004 and gitlab2002 use the same ssh host keys for ssh-gitlab daemon
- prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and gitlab2002 as profile::gitlab::active_host operations/puppet/+/891863
- Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004 operations/dns/+/891888
- apply gitlab-settings to gitlab1004 and gitlab2002
- announce downtime some days ahead on ops/releng list/broadcast message
- lower TTL for gitlab DNS records to 300s (also PTR) operations/dns/+/891886
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
- downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002- T329931" -M 120 'gitlab2002.wikimedia.org'
- stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
- stop GitLab on gitlab1004 with gitlab-ctl stop nginx
- stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
- create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
- merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and run puppet operations/puppet/+/891863
- trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
- Merge change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004 operations/dns/+/891888
- verify installation
- enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
- start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
- unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
- announce end of downtime