Current gitlab-replica: gitlab1003
Current gitlab-replica-old: gitlab1004
This is being done to test the sre.gitlab.failover cookbook.
Checklist (modified from the previous T329931 to reflect new cookbook):
Preparations before downtime:
- prepare change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1004, and profile::gitlab::service_name: gitlab-replica-old.wikimedia.org on gitlab1003 operations/puppet/+/909244
- Prepare change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab1003 operations/dns/+/909248
- apply gitlab-settings to gitlab1004 and gitlab1003
- announce downtime some days ahead on ops/releng list/broadcast message
- lower TTL for gitlab DNS records to 300s (also PTR) (This was never changed back)
Scheduled downtime:
- Announce downtime in #wikimedia-gitlab
- Start gitlab failover cookbook with cookbook sre.gitlab.failover --current-primary gitlab1003 --new-primary gitlab1004 -t T334838
- _when prompted_ merge change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1004 operations/puppet/+/909244
- _when prompted_ Merge change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1004 gitlab-replica-old.wikimedia.org to gitlab1003 operations/dns/+/909248
- verify installation
- announce end of downtime
Steps removed from original checklists (marked as complete for clarity)
- check gitlab1004 and gitlab2002 use the same ssh host keys for ssh-gitlab daemon
- pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
- downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002- T329931" -M 120 'gitlab2002.wikimedia.org'
- stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
- stop GitLab on gitlab1004 with gitlab-ctl stop nginx
- stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
- create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
- trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
- enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
- start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
- unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)