Page MenuHomePhabricator

Document and test failover for GitLab and GitLab Replica
Closed, ResolvedPublic

Description

This is a followup task of T285867.
Currently GitLab replica is only a passive instance. We should think about what is needed to promote the replica to an active instance in case of emergency (DC loss, networking issues, hardware issues in ganeti, ...).

So we would need some kind of cookbook/documentation what steps are needed. Some topics which come to my mind:

  • restore backup (either from bacula or from production GitLab, if reachable)
  • make sure SSH host keys match so users don't get an error
  • switch DNS entries (here we might need a CNAME instead of A/AAA which we can switch easily)
  • re-assign Runners(?)
  • change CAS/SSO settings

Related Wikitech page: https://wikitech.wikimedia.org/wiki/GitLab/Failover

Event Timeline

Jelto changed the task status from Open to In Progress.Jun 8 2022, 2:33 PM
Jelto claimed this task.
Jelto triaged this task as Medium priority.

We gathered some experience regarding failover when migrating GitLab to the new physical hosts in T307142.

I used the preparation for the migration in T307142 and started documenting the process in https://wikitech.wikimedia.org/wiki/GitLab/Failover. @Arnoldokoth
maybe you can take a look

I'd like to use this documentation to failover gitlab-replica from gitlab1003 to gitlab2002. This is a good test if we covered all steps and we have production instance and replica separated again on both data centers.

Change 818505 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: add gitlab role to gitlab2002

https://gerrit.wikimedia.org/r/818505

Change 818505 merged by AOkoth:

[operations/puppet@production] gitlab: add gitlab role to gitlab2002

https://gerrit.wikimedia.org/r/818505

Change 819589 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: enable restore on gitlab2002

https://gerrit.wikimedia.org/r/819589

Change 819672 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: switch gerrit2002 from gerrit:migration to gerrit role

https://gerrit.wikimedia.org/r/819672

Change 819672 merged by Dzahn:

[operations/puppet@production] gerrit: switch gerrit2002 from gerrit:migration to gerrit role

https://gerrit.wikimedia.org/r/819672

Change 819589 merged by AOkoth:

[operations/puppet@production] gitlab: enable restore on gitlab2002

https://gerrit.wikimedia.org/r/819589

Change 820163 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: copy ssh host keys for failover

https://gerrit.wikimedia.org/r/820163

Change 820540 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/dns@master] gitlab: lower gitlab-replica TTL

https://gerrit.wikimedia.org/r/820540

Change 820163 abandoned by AOkoth:

[operations/puppet@production] gitlab: copy ssh host keys for failover

Reason:

found a workaround

https://gerrit.wikimedia.org/r/820163

Change 820163 restored by AOkoth:

[operations/puppet@production] gitlab: copy ssh host keys for failover

https://gerrit.wikimedia.org/r/820163

Change 820545 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: change service_name for gitlab-replica

https://gerrit.wikimedia.org/r/820545

Change 820540 merged by AOkoth:

[operations/dns@master] gitlab: lower gitlab-replica TTL

https://gerrit.wikimedia.org/r/820540

Change 820548 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/dns@master] gitlab: update gitlab-replica records

https://gerrit.wikimedia.org/r/820548

Change 820545 merged by AOkoth:

[operations/puppet@production] gitlab: change service_name for gitlab-replica

https://gerrit.wikimedia.org/r/820545

Change 820548 merged by AOkoth:

[operations/dns@master] gitlab: update gitlab-replica records

https://gerrit.wikimedia.org/r/820548

# host gitlab-replica-old.wikimedia.org
gitlab-replica-old.wikimedia.org has address 208.80.154.15
gitlab-replica-old.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:15
# host gitlab-replica.wikimedia.org
gitlab-replica.wikimedia.org has address 208.80.153.8
gitlab-replica.wikimedia.org has IPv6 address 2620:0:860:1:208:80:153:8

@Jelto I managed to switch over the hosts i.e. gitlab1003 and gitlab2002. The current replica is now gitlab2002. Though the old replica is currently experiencing some SSL issues and I'm not entirely sure why. Perhaps I missed a configuration step but didn't do much on the old host other than disabling and re-enabling puppet.

Thanks to @Dzahn the SSL issue is now resolved.

I did have a question though, what happens to the old replica (gitlab1003)? I left it as a passive host on puppet and it's now currently being pointed to by the DNS entry gitlab-replica-old.

@Arnoldokoth For the sake of completeness, could you mention what the SSL issue and fix were?

https://gerrit.wikimedia.org/r/c/operations/puppet/+/820563 @LSobanski This was the fix for the SSL issue. The record gitlab-replica-old had been used before for this purpose but it was removed when testing was complete.

Change 824244 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/dns@master] gitlab: revert gitlab-replica TTL to 600s

https://gerrit.wikimedia.org/r/824244

Change 824244 merged by AOkoth:

[operations/dns@master] gitlab: revert gitlab-replica TTL to 600s

https://gerrit.wikimedia.org/r/824244

Change 820163 abandoned by AOkoth:

[operations/puppet@production] gitlab: copy ssh host keys for failover

Reason:

not needed anymore

https://gerrit.wikimedia.org/r/820163