Page MenuHomePhabricator

Switch over gitlab-replica (gitlab1003 -> gitlab1004)
Closed, ResolvedPublic

Description

Current gitlab-replica: gitlab1003
Current gitlab-replica-old: gitlab1004

This is being done to test the sre.gitlab.failover cookbook.

Checklist (modified from the previous T329931 to reflect new cookbook):

Preparations before downtime:

  • prepare change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1004, and profile::gitlab::service_name: gitlab-replica-old.wikimedia.org on gitlab1003 operations/puppet/+/909244
  • Prepare change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab1003 operations/dns/+/909248
  • apply gitlab-settings to gitlab1004 and gitlab1003
  • announce downtime some days ahead on ops/releng list/broadcast message
  • lower TTL for gitlab DNS records to 300s (also PTR) (This was never changed back)

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • Start gitlab failover cookbook with cookbook sre.gitlab.failover --current-primary gitlab1003 --new-primary gitlab1004 -t T334838
  • _when prompted_ merge change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1004 operations/puppet/+/909244
  • _when prompted_ Merge change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab1003 operations/dns/+/909248
  • verify installation
  • announce end of downtime

Steps removed from original checklists (marked as complete for clarity)

  • check gitlab1004 and gitlab2002 use the same ssh host keys for ssh-gitlab daemon
  • pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
  • downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002- T329931" -M 120 'gitlab2002.wikimedia.org'
  • stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
  • stop GitLab on gitlab1004 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
  • create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
  • trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
  • start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
  • unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)

Event Timeline

Change 909244 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Switch gitlab-replica and gitlab-replica-old hosts

https://gerrit.wikimedia.org/r/909244

Change 909248 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Move DNS names for gitlab-replica{,-old}

https://gerrit.wikimedia.org/r/909248

Change 909244 merged by EoghanGaffney:

[operations/puppet@production] Switch gitlab-replica and gitlab-replica-old hosts

https://gerrit.wikimedia.org/r/909244

Change 909248 merged by EoghanGaffney:

[operations/dns@master] Move DNS names for gitlab-replica{,-old}

https://gerrit.wikimedia.org/r/909248

This switchover was completed successfully. I'll keep this open until tomorrow to verify no issues, then close it.

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1003.wikimedia.org to gitlab2002.wikimedia.org) started

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab1003.wikimedia.org to gitlab2002.wikimedia.org) finished