Page MenuHomePhabricator

Switchover gitlab (gitlab1004 -> gitlab2002) - October 2023
Closed, ResolvedPublic

Description

GitLab will be switched during September 2023 Datacenter Switchover from eqiad to codfw (one week after the actual switchover, to not block dependencies). This task tracks the failover of the GitLab production instance in eqiad (gitlab1004) to codfw (gitlab2002).

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: Thursday, 5th October 2023, 09:00 UTC

Checklist:

Preparations before downtime:

  • prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002, set gitlab2002 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab1004 (operations/puppet/+/963160/)
  • Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab2002, and gitlab-replica-old.wikimedia.org to gitlab1004 (operations/dns/+/963161)
  • apply gitlab-settings to gitlab2002 and gitlab1004 (releng/gitlab-settings/-/merge_requests/46/diffs)
  • announce downtime some days ahead on ops/releng list/broadcast message
  • run a failover backup on the source host one day in advance sudo /srv/gitlab-backup/gitlab-backup.sh failover

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab1004 --switch-to gitlab2002 -t T345531
  • When prompted, merge the puppet change prepared above
  • When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions
  • Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality
  • Announce end of downtime

Falling back to manual steps:

If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:

  • Announce downtime in #wikimedia-gitlab
  • pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
  • downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002 - T329931" -M 120 'gitlab1004.wikimedia.org'
  • stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
  • stop GitLab on gitlab1004 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
  • create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
  • merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and run puppet
  • trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • merge change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004
  • verify installation
  • enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
  • start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
  • unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
  • announce end of downtime

Event Timeline

LSobanski triaged this task as Medium priority.Sep 4 2023, 3:15 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.
LSobanski renamed this task from Switchover gitlab (gitlab1004 -> gitlab2002) to Switchover gitlab (gitlab1004 -> gitlab2002) - September 2023.Sep 5 2023, 12:04 PM

Aside from announcing in IRC, we should probably also announce on Wikitech.

Aside from announcing in IRC, we should probably also announce on Wikitech.

Please note that we moved the switchover to Thursday, 5th October 2023, 09:00 UTC. We'll do a bit more testing with the replica switchover first T345590.

eoghan renamed this task from Switchover gitlab (gitlab1004 -> gitlab2002) - September 2023 to Switchover gitlab (gitlab1004 -> gitlab2002) - October 2023.Oct 3 2023, 8:07 AM

This switchover wasn't part of T345265: CommRel support for September 2023 Datacenter Switchover, so it wasn't announced the same way the Mediawiki switchover was covered.

Aside from announcing in IRC, we should probably also announce on Wikitech.

I got the news from wikitech-l.
Next time, it should also be in Tech News. :)

Change 963160 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [gitlab/switchover] Change profile::gitlab::service_name for switchover

https://gerrit.wikimedia.org/r/963160

Change 963161 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] [gitlab/switchover] Update DNS for gitlab/gitlab-replica

https://gerrit.wikimedia.org/r/963161

Jelto updated the task description. (Show Details)

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org) started

Change 963160 merged by Jelto:

[operations/puppet@production] [gitlab/switchover] Change profile::gitlab::service_name for switchover

https://gerrit.wikimedia.org/r/963160

Change 963161 merged by Jelto:

[operations/dns@master] [gitlab/switchover] Update DNS for gitlab/gitlab-replica

https://gerrit.wikimedia.org/r/963161

Change 963706 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: make gitlab2002 the active host

https://gerrit.wikimedia.org/r/963706

Change 963706 merged by Jelto:

[operations/puppet@production] gitlab: make gitlab2002 the active host

https://gerrit.wikimedia.org/r/963706

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org) finished

Change 963739 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/cookbooks@master] gitlab/failover: remove deploy-page at the end of cookbook

https://gerrit.wikimedia.org/r/963739

Change 963739 merged by jenkins-bot:

[operations/cookbooks@master] gitlab/failover: remove deploy-page at the end of cookbook

https://gerrit.wikimedia.org/r/963739

Change 964003 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: install warning banner only on replicas when doing a restore

https://gerrit.wikimedia.org/r/964003

Change 964003 merged by Jelto:

[operations/puppet@production] gitlab: install warning banner only on replicas when doing a restore

https://gerrit.wikimedia.org/r/964003

production GitLab was switched from gitlab1004 to gitlab2002 in codfw. I'll close the task.