Page MenuHomePhabricator

Switchover gitlab (gitlab2002 -> gitlab1004)
Closed, ResolvedPublic

Description

GitLab will be switched during April/May 2023 Datacenter Switchback from codfw to eqiad (one week after the actual switchover, to not block dependencies). This task tracks the failover of the GitLab production instance in codfw (gitlab2002) to eqiad (gitlab1004).

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: 09:00 UTC, May 2nd 2023

Checklist:

Preparations before downtime:

  • prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1004, set gitlab1004 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab2002 operations/puppet/+/912881
  • Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab1004, and gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/912972
  • apply gitlab-settings to gitlab1004 and gitlab2002
  • announce downtime some days ahead on ops/releng list/broadcast message

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab2002 --switch-to gitlab1004 -t T335504
  • When prompted, merge the puppet change prepared above operations/puppet/+/912881
  • When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions -- operations/dns/+/912972

Falling back to manual steps:

If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:

  • Announce downtime in #wikimedia-gitlab
  • pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
  • downtime gitlab2002 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1004 - T329931" -M 120 'gitlab2002.wikimedia.org'
  • stop puppet on gitlab2002 with sudo disable-puppet "Running failover to gitlab1004 - T329931"
  • stop GitLab on gitlab2002 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab2002 with systemctl stop ssh-gitlab
  • create full backup on gitlab2002 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab2002 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1004.wikimedia.org/data-backup
  • merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1004 and run puppet operations/puppet/+/912881
  • trigger restore on gitlab1004 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • merge change to point DNS entry for gitlab.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/912972
  • verify installation
  • enable puppet on gitlab2002 with sudo run-puppet-agent -e "Running failover to gitlab1004 - T329931"
  • start ssh-gitlab daemon on gitlab2002 with systemctl start ssh-gitlab
  • unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
  • announce end of downtime

Event Timeline

Change 912881 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [gitlab/failover] Switch primary from codfw->eqiad

https://gerrit.wikimedia.org/r/912881

Change 912972 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] [gitlab/failover] Swap DNS entries for gitlab

https://gerrit.wikimedia.org/r/912972

We have abandoned the maintenance today, due to hitting this error when Gitlab was put into maintenance mode: T333347

This was not an issue during the last maintenance window, since the extension which added this behaviour was deployed after the last DC switchover: T324149

My apologies for the roadblock. I think a down or missing GitLab endpoint should be handled gracefully with the above patch, which I can probably get deployed once reviewed.

Change 912881 merged by EoghanGaffney:

[operations/puppet@production] [gitlab/failover] Switch primary from codfw->eqiad

https://gerrit.wikimedia.org/r/912881

Change 912972 merged by EoghanGaffney:

[operations/dns@master] [gitlab/failover] Swap DNS entries for gitlab

https://gerrit.wikimedia.org/r/912972

The gitlab maintenance was completed successfully. We will keep this open until tomorrow to monitor for issues

I double-checked the timers on the old and new host, and we have the correct combination of backup and sync jobs on gitlab1004 and restores on gitlab1003 and gitlab2002:

prod:

gitlab1004:~$ systemctl list-timers | grep -E 'restore|backup|rsync'
Tue 2023-05-09 00:00:00 UTC 12h left            n/a                         n/a          config-backup.timer                                config-backup.service
Tue 2023-05-09 00:04:00 UTC 12h left            n/a                         n/a          full-backup.timer                                  full-backup.service
Tue 2023-05-09 01:00:00 UTC 13h left            n/a                         n/a          rsync-config-backup-gitlab1003.wikimedia.org.timer rsync-config-backup-gitlab1003.wikimedia.org.service
Tue 2023-05-09 01:00:00 UTC 13h left            n/a                         n/a          rsync-config-backup-gitlab2002.wikimedia.org.timer rsync-config-backup-gitlab2002.wikimedia.org.service
Tue 2023-05-09 01:00:00 UTC 13h left            n/a                         n/a          rsync-data-backup-gitlab1003.wikimedia.org.timer   rsync-data-backup-gitlab1003.wikimedia.org.service
Tue 2023-05-09 01:00:00 UTC 13h left            n/a                         n/a          rsync-data-backup-gitlab2002.wikimedia.org.timer   rsync-data-backup-gitlab2002.wikimedia.org.service

replicas:

gitlab1003:~$ systemctl list-timers | grep -E 'restore|backup|rsync'
Tue 2023-05-09 02:00:00 UTC 14h left            Mon 2023-05-08 02:00:13 UTC 9h ago       backup-restore.timer                            backup-restore.service
gitlab2002:~$ systemctl list-timers | grep -E 'restore|backup|rsync'
Tue 2023-05-09 02:00:00 UTC 14h left            n/a                         n/a          backup-restore.timer                            backup-restore.service