Page MenuHomePhabricator

Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024
Closed, ResolvedPublicDesign

Description

GitLab replica will be switched in preparation for the March 2024 Datacenter Switchover. This task tracks the failover of the GitLab replica from gitlab1004 to gitlab1003.

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Time: 2024-03-11

Checklist:

Preparations before downtime:

  • prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1003, set gitlab1003 as profile::gitlab::active_host:, and set profile::gitlab::service_name: 'gitlab-replica-old.wikimedia.org on gitlab1004 (<patch link>)
  • Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab1003, and gitlab-replica-old.wikimedia.org to gitlab1004 (<patch link>)
  • apply gitlab-settings to gitlab1003 and gitlab1004 (<patch link>)
  • announce downtime some days ahead on ops/releng list/broadcast message
  • run a failover backup on the source host one day in advance sudo /srv/gitlab-backup/gitlab-backup.sh failover

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab1004 --switch-to gitlab1003 -t <ticket id>
  • When prompted, merge the puppet change prepared above
  • When prompted, merge the DNS change prepared above and run `authdns-update on the DNS master, following the DNS update instructions
  • ~Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality ~
  • ~Announce end of downtime~

Falling back to manual steps:

If, for some reason, the cookbook cannot be used, the manual steps for failing over can be used here:

  • Announce downtime in #wikimedia-gitlab
  • pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
  • downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1003 - <ticket id>" -M 120 'gitlab1004.wikimedia.org'
  • stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab1003 - <ticket id>"
  • stop GitLab on gitlab1004 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
  • create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1003.wikimedia.org/data-backup
  • merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab1003 and run puppet
  • trigger restore on gitlab1003 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • merge change to point DNS entry for gitlab.wikimedia.org to gitlab1003 gitlab-replica-old.wikimedia.org to gitlab1004
  • verify installation
  • enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab1003 - <ticket id>"
  • start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
  • unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
  • announce end of downtime

Event Timeline

LSobanski renamed this task from Switchover gitlab (<old host> -> <new host>) - <month> <year> to Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024.Feb 27 2024, 10:51 AM
LSobanski updated the task description. (Show Details)
LSobanski triaged this task as Medium priority.
LSobanski updated the task description. (Show Details)
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Change 1009298 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [gitlab] Failover test of gitlab replica hosts

https://gerrit.wikimedia.org/r/1009298

Change 1009300 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] [gitlab] Failover test of gitlab replica hosts

https://gerrit.wikimedia.org/r/1009300

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Change 1009298 merged by EoghanGaffney:

[operations/puppet@production] [gitlab] Failover test of gitlab replica hosts

https://gerrit.wikimedia.org/r/1009298

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Change 1010559 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/cookbooks@master] [gitlab] Fix progress_bars parameter (should be print_progress_bars)

https://gerrit.wikimedia.org/r/1010559

Change 1010559 merged by jenkins-bot:

[operations/cookbooks@master] [gitlab] Fix progress_bars parameter (should be print_progress_bars)

https://gerrit.wikimedia.org/r/1010559

Change #1013339 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003

https://gerrit.wikimedia.org/r/1013339

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Change #1013339 merged by EoghanGaffney:

[operations/puppet@production] [gitlab] Switch gitlab-replica from gitlab1004 to gitlab1003

https://gerrit.wikimedia.org/r/1013339

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback completed

Change #1013585 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [gitlab] Move backup script locking out of main script root

https://gerrit.wikimedia.org/r/1013585

Change #1013585 merged by EoghanGaffney:

[operations/puppet@production] [gitlab] Move backup script locking out of main script root

https://gerrit.wikimedia.org/r/1013585

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) started

Change #1009300 merged by EoghanGaffney:

[operations/dns@master] [gitlab] Failover test of gitlab replica hosts

https://gerrit.wikimedia.org/r/1009300

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org) encountered errors. Rollback started

The cookbook completed successfully, however it registered a failure because the final puppet run was blocked by a cron job. I've rectified this in another patch to the failover cookbook

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org) finished

We'll need a maintenance window of around 4 hours, and we'll use most of it.

We did another switchover of the replica instances today, here's the rough timelines:

  • Initial backup: 1 hour, 57 minutes
  • rsync transfer: 22 minutes
  • gitlab restore: 53 minutes

This brings us to approximately 3 hours 15 minutes, and that's before the puppet runs, waiting for DNS changes to propagate, etc. Conservatively, and if we're quick to interact at all steps, we'll need 3.5 hours. I don't think we'll need much more, but we certainly won't need much less.

I suggest a 4 hour window, in two weeks to do the transfer. If we're in agreement, I'll start sending notices on Thursday, and perform the maintenance on Thursday 25th April.

The switchover of the replicas completed successfully.