GitLab will be switched during April/May 2023 Datacenter Switchback from codfw to eqiad (one week after the actual switchover, to not block dependencies). This task tracks the failover of the GitLab production instance in codfw (gitlab2002) to eqiad (gitlab1004).
Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: 09:00 UTC, May 2nd 2023
Checklist:
**Preparations before downtime:**
[x] prepare change to set `profile::gitlab::service_name: 'gitlab.wikimedia.org'` on `gitlab1004`, set `gitlab1004` as `profile::gitlab::active_host:`, and set `profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org` on `gitlab2002` [operations/puppet/+/912881](https://gerrit.wikimedia.org/r/c/operations/puppet/+/912881)
[x] Prepare change to point DNS entry for `gitlab.wikimedia.org` to `gitlab1004`, and `gitlab-replica-old.wikimedia.org` to `gitlab2002` [operations/dns/+/912972](https://gerrit.wikimedia.org/r/c/operations/dns/+/912972)
[x] apply [gitlab-settings](https://gitlab.wikimedia.org/repos/releng/gitlab-settings) to `gitlab1004` and `gitlab2002`
[x] announce downtime some days ahead on ops/releng list/broadcast message
**Scheduled downtime**:
[x] Announce downtime in `#wikimedia-gitlab`
[ ] Start gitlab failover cookbook on the cumin host with `cookbook sre.gitlab.failover --switch-from gitlab2002 --switch-to gitlab1004 -t T335504`
[ ] When prompted, merge the puppet change prepared above [operations/puppet/+/912881](https://gerrit.wikimedia.org/r/c/operations/puppet/+/912881)
[ ] When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following [the DNS update instructions](https://wikitech.wikimedia.org/wiki/DNS#Changing_records_in_a_zonefile) -- [operations/dns/+/912972](https://gerrit.wikimedia.org/r/c/operations/dns/+/912972)
**Falling back to manual steps**:
If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:
[ ] Announce downtime in `#wikimedia-gitlab`
[ ] pause all GitLab Runners ([gitlab-settings](https://gitlab.wikimedia.org/repos/releng/gitlab-settings) `./runners active | tee active.txt && ./runners pause < active.txt`)
[ ] downtime gitlab2002 `sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1004 - T329931" -M 120 'gitlab2002.wikimedia.org'`
[ ] stop puppet on `gitlab2002` with `sudo disable-puppet "Running failover to gitlab1004 - T329931"`
[ ] stop GitLab on `gitlab2002` with `gitlab-ctl stop nginx`
[ ] stop ssh-gitlab daemon on `gitlab2002` with `systemctl stop ssh-gitlab`
[ ] create **full** backup on `gitlab2002` with `/usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1" `
[ ] sync backup, on `gitlab2002` run `/usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1004.wikimedia.org/data-backup`
[ ] merge change to set `profile::gitlab::service_name: 'gitlab.wikimedia.org'` on `gitlab1004` and run puppet [operations/puppet/+/912881](https://gerrit.wikimedia.org/r/c/operations/puppet/+/912881)
[ ] trigger restore on **`gitlab1004`** run `sudo systemctl start backup-restore.service` (for logs, run `journalctl -f -u backup-restore.service`)
[ ] merge change to point DNS entry for `gitlab.wikimedia.org` to `gitlab1004` `gitlab-replica-old.wikimedia.org` to `gitlab2002` [operations/dns/+/912972](https://gerrit.wikimedia.org/r/c/operations/dns/+/912972)
[ ] verify installation
[ ] enable puppet on `gitlab2002` with `sudo run-puppet-agent -e "Running failover to gitlab1004 - T329931"`
[ ] start ssh-gitlab daemon on `gitlab2002` with `systemctl start ssh-gitlab`
[ ] unpause all GitLab Runners ([gitlab-settings](https://gitlab.wikimedia.org/repos/releng/gitlab-settings) `./runners unpause < active.txt`)
[ ] announce end of downtime