Page MenuHomePhabricator

Switchover gitlab (gitlab1004 -> gitlab2002)
Closed, ResolvedPublic

Description

GitLab will be switched during March 2023 Datacenter Switchover (T327920) from eqiad to codfw too (one day before the actual switchover, to not block dependencies). This task tacks the failover of the GitLab production instance eqiad (gitlab1004) to codfw (gitlab2002).

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Last task: T307142#7971192 (checklist can be adapted for this years failover)
Time: 10:00 am UTC 27th of February

Checklist:

Preparations before downtime:

  • check gitlab1004 and gitlab2002 use the same ssh host keys for ssh-gitlab daemon
  • prepare change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and gitlab2002 as profile::gitlab::active_host operations/puppet/+/891863
  • Prepare change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004 operations/dns/+/891888
  • apply gitlab-settings to gitlab1004 and gitlab2002
  • announce downtime some days ahead on ops/releng list/broadcast message
  • lower TTL for gitlab DNS records to 300s (also PTR) operations/dns/+/891886

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • pause all GitLab Runners (gitlab-settings ./runners active | tee active.txt && ./runners pause < active.txt)
  • downtime gitlab1004 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab2002- T329931" -M 120 'gitlab2002.wikimedia.org'
  • stop puppet on gitlab1004 with sudo disable-puppet "Running failover to gitlab2002 - T329931"
  • stop GitLab on gitlab1004 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab1004 with systemctl stop ssh-gitlab
  • create full backup on gitlab1004 with /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab1004 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab2002.wikimedia.org/data-backup
  • merge change to set profile::gitlab::service_name: 'gitlab.wikimedia.org' on gitlab2002 and run puppet operations/puppet/+/891863
  • trigger restore on gitlab2002 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • Merge change to point DNS entry for gitlab.wikimedia.org to gitlab2002 gitlab-replica-old.wikimedia.org to gitlab1004 operations/dns/+/891888
  • verify installation
  • enable puppet on gitlab1004 with sudo run-puppet-agent -e "Running failover to gitlab2002 - T329931"
  • start ssh-gitlab daemon on gitlab1004 with systemctl start ssh-gitlab
  • unpause all GitLab Runners (gitlab-settings ./runners unpause < active.txt)
  • announce end of downtime

Event Timeline

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.
Jelto updated the task description. (Show Details)

I scheduled a new broadcast message on GitLab for the upcoming switchover next Monday. The message should appear tomorrow morning until Monday after the maintenance window.

Change 891863 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw)

https://gerrit.wikimedia.org/r/891863

Change 891886 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Lower TTL on gitlab records to 300 seconds to facilitate failover

https://gerrit.wikimedia.org/r/891886

Change 891888 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Update records for gitlab from gitlab1004 -> gitlab2002

https://gerrit.wikimedia.org/r/891888

Change 891886 merged by Jelto:

[operations/dns@master] Lower TTL on gitlab records to 300 seconds to facilitate failover

https://gerrit.wikimedia.org/r/891886

Mentioned in SAL (#wikimedia-operations) [2023-02-27T10:17:50Z] <eoghan@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931

Mentioned in SAL (#wikimedia-operations) [2023-02-27T10:18:03Z] <eoghan@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931

Jelto updated the task description. (Show Details)

Change 891863 merged by EoghanGaffney:

[operations/puppet@production] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw)

https://gerrit.wikimedia.org/r/891863

Change 891888 merged by EoghanGaffney:

[operations/dns@master] Update records for gitlab from gitlab1004 -> gitlab2002

https://gerrit.wikimedia.org/r/891888

Maintenance finished, GitLab is back again. If you face any issues feel free to post it here.

I'm leaving the task open in case any issues come up. Otherwise we can close this task tomorrow.

Thanks @eoghan for running the switchover!

A "Backup freshness" alert with summary:"No backups: 1 (gitlab2002), Fresh: 116 jobs"" triggered. This should resolve on the next bacula run this night. Bacula was disabled for the old instance gitlab1004 and enabled for gitlab2002.

Change 892892 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

As mentioned in T330717, the new production host gitlab2002 still had the restore enabled and executed a restore this night.

The restore timer is handled manually by the hiera key profile::gitlab::enable_restore: true. This key was not updated during the switchover and the timer was not removed. So the restore was triggered at 2:00 UTC. The restore was executed from previously done backup from 0:04 UTC.

So we lost data for this two hours.

I posted some updates about that wikitech-l and engineering-all. I'll also add a short message to slack and IRC, to make sure all are aware of that.

The above change should fix this and automatically enable and disable the restore, depending on the state of the host (replica or production host).

I started a incident report in 2023-02-28_GitLab_data_loss.

It seems we were quite lucky and only got one git connection over SSH during that time. I'm still grepping over the other logs to try estimate the impact of affected users/actions.

Change 892892 merged by Jelto:

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

The rsync jobs between production host and replica are only created but not removed, when the list of replicas change. On the former production instance gitlab1004 the jobs rsync-config-backup-gitlab2002.wikimedia.org.timer and rsync-data-backup-gitlab2002.wikimedia.org.timer are still present.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/gitlab/manifests/rsync.pp#40

The jobs failed because rsync server is disabled automatically on the new production instance. However we should find a way to manag the rsync jobs properly. Removing the jobs by hand after each failover sounds error prone.

The rsync jobs between production host and replica are only created but not removed, when the list of replicas change. On the former production instance gitlab1004 the jobs rsync-config-backup-gitlab2002.wikimedia.org.timer and rsync-data-backup-gitlab2002.wikimedia.org.timer are still present.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/gitlab/manifests/rsync.pp#40

The jobs failed because rsync server is disabled automatically on the new production instance. However we should find a way to manag the rsync jobs properly. Removing the jobs by hand after each failover sounds error prone.

^ this is fixed now, subtask T330744 closed.

I've done some more research what has happened during the backup and restore incident on the new production instance gitlab2002.

According to GitLab gitaly and SSH logs only one commit over SSH happened during the incident. In two other projects I found merges and some kind of upload:

https://logstash.wikimedia.org/goto/3e9666227f80f732b35548191d15e4f8

The API/webserver is quite verbose and is not fully parsed in logstash. So it's a bit tricky to find out other actions like comments, actions on merge requests and so on.

I've spoken with Jelto and we're happy to close this ticket.

The switchover process itself was largely completed without issue, however the data-loss incident did occur that night. We've identified a few places to safeguard this, and either completed them or opened tickets for them. There is also future work around our architecture for the gitlab service that will be put on the roadmap later.