GitLab will be switched during September 2023 Datacenter Switchover from eqiad to codfw (one week after the actual switchover, to not block dependencies). This task tracks the failover of the GitLab production instance in eqiad (gitlab1004) to codfw (gitlab2002).
Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: Thursday, 5th October 2023, 09:00 UTC
Checklist:
Preparations before downtime:
- prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002, set gitlab2002 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab1004 (operations/puppet/+/963160/)
- Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab2002, and gitlab-replica-old.wikimedia.org to gitlab1004 (operations/dns/+/963161)
- apply gitlab-settings to gitlab2002 and gitlab1004 (releng/gitlab-settings/-/merge_requests/46/diffs)
- announce downtime some days ahead on ops/releng list/broadcast message
- run a failover backup on the source host one day in advance sudo /srv/gitlab-backup/gitlab-backup.sh failover
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab1004 --switch-to gitlab2002 -t T345531
- When prompted, merge the puppet change prepared above
- When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions
- Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality
- Announce end of downtime
Falling back to manual steps:
If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:
- Announce downtime in #wikimedia-gitlab
- pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
- downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002 - T329931" -M 120 'gitlab1004.wikimedia.org'
- stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
- stop GitLab on gitlab1004 with gitlab-ctl stop nginx
- stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
- create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
- merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and run puppet
- trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
- merge change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004
- verify installation
- enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
- start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
- unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
- announce end of downtime