Page MenuHomePhabricator

Switchover gitlab-replica (gitlab2002 -> gitlab1003)
Closed, ResolvedPublic

Description

GitLab will be switched during March 2023 Datacenter Switchover (T327920) from eqiad to codfw too (one day before the actual switchover, to not block dependencies). This task tacks the dry tun to failover the replicas from codfw (gitlab2002) to eqiad (gitlab1003).

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover
Last task: T307142#7969993 (checklist can be adapted for this years failover)
Time: TBD, somewhere next week

Checklist: (WIP)

Preparations before downtime:

  • check gitlab2001 and gitlab1003 use the same ssh host keys for ssh-gitlab daemon
  • prepare change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1003 /operations/puppet/+/890779/
  • Prepare change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1003 gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/890785
  • configure gitlab1004 as profile::gitlab::active_host not needed on replica
  • apply gitlab-settings to gitlab1003 and gitlab2002
  • announce downtime some days ahead on ops/releng list? not needed on replica

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab not needed on replica
  • pause all GitLab Runners not needed on replica
  • downtime gitlab2002 sudo cookbook sre.hosts.downtime -r "Running failover to gitlab1003 - T329930" -M 60
  • stop puppet on gitlab2002 with sudo disable-puppet "Running failover to gitlab1003 - T329930"
  • stop GitLab on gitlab2002 with gitlab-ctl stop nginx
  • stop ssh-gitlab daemon on gitlab2002 with systemctl stop ssh-gitlab
  • create full backup on gitlab2002 with /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
  • sync backup, on gitlab2002 run /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://gitlab1003.wikimedia.org/data-backup
  • merge change to set profile::gitlab::service_name: 'gitlab-replica.wikimedia.org' on gitlab1003 /operations/puppet/+/890779/ and run puppet
  • trigger restore on gitlab1003 run sudo systemctl start backup-restore.service (for logs, run journalctl -f -u backup-restore.service)
  • Merge change to point DNS entry for gitlab-replica.wikimedia.org to gitlab1003 gitlab-replica-old.wikimedia.org to gitlab2002 operations/dns/+/890785
  • verify installation
  • enable puppet on gitlab2002 with sudo run-puppet-agent -e "Running failover to gitlab1003 - T329930"
  • start ssh-gitlab daemon on gitlab2002 with systemctl stop ssh-gitlab
  • unpause all GitLab Runners not needed on replica
  • announce end of downtime not needed on replica

Once the switchover was successful, we will proceed with GitLab production switchover in T329931.

Event Timeline

Jelto updated the task description. (Show Details)

Change 890434 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: allow rsync between replicas

https://gerrit.wikimedia.org/r/890434

Change 890434 merged by Jelto:

[operations/puppet@production] gitlab: allow rsync between replicas

https://gerrit.wikimedia.org/r/890434

Change 890779 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Change the active gitlab replica host to be the eqiad instance

https://gerrit.wikimedia.org/r/890779

Change 890785 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Update DNS to switch gitlab-replica

https://gerrit.wikimedia.org/r/890785

Jelto updated the task description. (Show Details)
Jelto updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-02-22T11:13:02Z] <eoghan@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930

Mentioned in SAL (#wikimedia-operations) [2023-02-22T11:13:18Z] <eoghan@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930

Jelto updated the task description. (Show Details)

Change 890779 merged by EoghanGaffney:

[operations/puppet@production] Change the active gitlab replica host to be the eqiad instance

https://gerrit.wikimedia.org/r/890779

Change 890785 merged by EoghanGaffney:

[operations/dns@master] Update DNS to switch gitlab-replica

https://gerrit.wikimedia.org/r/890785

After switching gitlab2002 and gitlab1003 I get a SSH key warning, although the hosts use the same host key:

$ git push
Warning: the ECDSA host key for 'gitlab-replica.wikimedia.org' differs from the key for the IP address '2620:0:861:1:208:80:154:15'
Offending key for IP in /home/jelto-wmf/.ssh/known_hosts.d/wmf-cloud:63
Matching host key in /home/jelto-wmf/.ssh/known_hosts.d/wmf-cloud:82
Exiting, you have requested strict checking.
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I'm not sure how we handled that last time. I think we announced the warning will appear after the switchover.

I'm not 100% if we can fix this short term for the switchover on Monday. Long term we could try to use a static IP and map it to the current production instance. Maybe we can add the keys to wmf-known-hosts? I'll do some more research.

I'm not sure how we handled that last time. I think we announced the warning will appear after the switchover.

I think T296944#8496308 is related here.

I think T296944#8496308 is related here.

Thanks for the link! We made sure we use the same SSH hostkey on all GitLab instances. So the reported host keys should not change and match the ones in the wikitech page.

I've done some more troubleshooting and it seems to be an issue related to my local SSH config/known hosts. In preparation for this switchover we made sure all GitLab instances use the same host key. gitlab1003 was using a different pair of host keys and I fixed that. However I already connected to that host before and accepted the host key. So the key changed due to the preparation work from my point of view.

I removed the old host key from my known hosts file and repeated the test by editing my local hosts file. I don't see the error about different key for address anymore.

This was completed successfully yesterday and everything remains stable 24 hours later.

The initial concerns around the ssh keys being incorrect doesn't seem to be a larger issue, after further testing we don't expect this will be an issue when we do the failover of the live service on Monday.

Ah, cool, this is a nice explanation and outcome. Thanks!