Page MenuHomePhabricator

Gitlab switchover (gitlab2002 → gitlab1004)
Closed, ResolvedPublic

Description

Docs: https://wikitech.wikimedia.org/wiki/GitLab/Failover

Checklist:

Preparations before downtime:

  • prepare the required Puppet changes change
  • Prepare the required DNS changes change
  • apply gitlab-settings to gitlab1004 and gitlab2002 change
  • announce downtime some days ahead on ops/releng list/broadcast message
  • make sure the daily backup and restore finished successfully on gitlab2002 and gitlab1004
    • systemctl status full-backup.service
    • systemctl status rsync-data-backup-gitlab1003.wikimedia.org.service
    • systemctl status rsync-data-backup-gitlab1004.wikimedia.org.service
    • systemctl status backup-restore.service

Scheduled downtime:

  • Announce downtime in #wikimedia-gitlab
  • Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from gitlab2002 --switch-to gitlab1004 -t T400252
  • When prompted, merge the puppet change prepared above
  • When prompted, merge the DNS change prepared above
  • run authdns-update on the DNS master, following the DNS update instructions
  • Update https://wikitech.wikimedia.org/wiki/GitLab to reflect the new reality
  • Announce end of downtime
  • copy missing packages
  • disable restore on old host gitlab2002 in case anything is missing

Fallback checklist for manual steps available in T358567 or https://wikitech.wikimedia.org/wiki/GitLab/Failover#During_failover_(manual_steps).

Event Timeline

Change #1172026 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Gitlab: switchover from gitlab2002 to gitlab1004

https://gerrit.wikimedia.org/r/1172026

Change #1172029 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/dns@master] Gitlab: switchover from gitlab2002 to gitlab1004

https://gerrit.wikimedia.org/r/1172029

Change #1173331 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: adjust backup and restore schedules for failover

https://gerrit.wikimedia.org/r/1173331

Change #1173331 merged by Jelto:

[operations/puppet@production] gitlab: adjust backup and restore schedules for failover

https://gerrit.wikimedia.org/r/1173331

Jelto triaged this task as High priority.
Jelto updated the task description. (Show Details)
Jelto moved this task from Incoming to Work in Progress on the collaboration-services board.

Change #1173614 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: make sure config backup is scheduled before data backup

https://gerrit.wikimedia.org/r/1173614

Change #1173614 merged by Jelto:

[operations/puppet@production] gitlab: make sure config backup is scheduled before data backup

https://gerrit.wikimedia.org/r/1173614

Todays replica restore was affected by the bug mentioned in T399306#11020503. The restore finished 06:31 UTC and would have delayed the planned failover between gitlab2002 and gitlab1004 by 30 minutes. I think this is acceptable and 06:30 UTC is still the best slot for a maintenance. So before running the cookbook tomorrow we should make sure the restore has finished already:

on gitlab2002:

systemctl status full-backup.service
systemctl status rsync-data-backup-gitlab1003.wikimedia.org.service
systemctl status rsync-data-backup-gitlab1004.wikimedia.org.service

on gitlab1004

systemctl status backup-restore.service

This logic could be added to some pre-flight check in the cookbook. But for tomorrows failover this is done manually.

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org) started

Change #1172026 merged by Jelto:

[operations/puppet@production] Gitlab: switchover from gitlab2002 to gitlab1004

https://gerrit.wikimedia.org/r/1172026

Jelto updated the task description. (Show Details)

Change #1172029 merged by Jelto:

[operations/dns@master] Gitlab: switchover from gitlab2002 to gitlab1004

https://gerrit.wikimedia.org/r/1172029

Cookbook cookbooks.sre.gitlab.failover (Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org) finished

The switchover to gitlab1004 was successful.

There are two missing packages:

/var/opt/gitlab/gitlab-rails/shared/packages/ae/aa/aeaa6370266a3650553410b0d9f8f3e02aa6bdfe68a2380a118fb3cf4a7d832f/packages/1551/files/12177/research-datasets-content_diff_index.conda.tgz
/var/opt/gitlab/gitlab-rails/shared/packages/ae/aa/aeaa6370266a3650553410b0d9f8f3e02aa6bdfe68a2380a118fb3cf4a7d832f/packages/1551/files/12176/research-datasets-content_diff_index.conda.tgz

which also generates 5xx when downloading this packages: https://gitlab.wikimedia.org/groups/repos/-/packages/1551

I'll check if the files can be copied or re-created.

The switchover to gitlab1004 was successful.

There are two missing packages:

/var/opt/gitlab/gitlab-rails/shared/packages/ae/aa/aeaa6370266a3650553410b0d9f8f3e02aa6bdfe68a2380a118fb3cf4a7d832f/packages/1551/files/12177/research-datasets-content_diff_index.conda.tgz
/var/opt/gitlab/gitlab-rails/shared/packages/ae/aa/aeaa6370266a3650553410b0d9f8f3e02aa6bdfe68a2380a118fb3cf4a7d832f/packages/1551/files/12176/research-datasets-content_diff_index.conda.tgz

which also generates 5xx when downloading this packages: https://gitlab.wikimedia.org/groups/repos/-/packages/1551

I'll check if the files can be copied or re-created.

I copied the two files from gitlab2002 to gitlab1004 and the download links in https://gitlab.wikimedia.org/groups/repos/-/packages/1551 work again.

Change #1174409 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: pause restore on gitlab2002

https://gerrit.wikimedia.org/r/1174409

Change #1174409 merged by Jelto:

[operations/puppet@production] gitlab: pause restore on gitlab2002

https://gerrit.wikimedia.org/r/1174409

Change #1174417 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/cookbooks@master] sre.gitlab.failover: use hostname in wipe-cache

https://gerrit.wikimedia.org/r/1174417

Just for the record, I lost ipv4 connectivity for a few minutes twice today to wikimedia.gitlab.com, where the packets were being dropped at equinix (or so it seemed). The first time was right after the window (no logs sorry), from the second time it happened I have an mtr:

03:02 PM (13:02 UTC) ~
dcaro@hephaestus$ mtr --port 443 --report-wide gitlab.wikimedia.org   # ip6 works
Start: 2025-07-30T15:02:31+0200
HOST: hephaestus                            Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2a04:ee41:89:a0e6:6e99:61ff:fe08:4af7  0.0%    10    5.1   3.7   1.8   5.3   1.2
  2.|-- 2a04:ee41:89:a000::                    0.0%    10    3.8   3.9   2.4   5.4   1.0
  3.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- cixp-salt.cern.ch                      0.0%    10    5.2   4.9   3.3   6.6   1.1
  5.|-- cixp-he.ipv6.cern.ch                   0.0%    10    7.7   5.6   3.9   7.7   1.1
  6.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
 11.|-- xe-5-3-3-500.cr1-eqiad.wikimedia.org   0.0%    10   89.6  90.4  88.1  99.4   3.2
 12.|-- gitlab.wikimedia.org                   0.0%    10   88.3  89.2  88.1  91.3   1.0


03:02 PM ~
dcaro@hephaestus$ mtr -4 --port 443 --report-wide gitlab.wikimedia.org   # ip4 fails
Start: 2025-07-30T15:02:54+0200
HOST: hephaestus                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                             0.0%    10    3.7   3.1   1.6   5.5   1.2
  2.|-- 10.124.92.1                          0.0%    10    4.0   4.9   3.0   7.5   1.5
  3.|-- 10.96.127.165                        0.0%    10    6.4   5.3   4.1   6.4   0.8
  4.|-- 10.96.127.161                        0.0%    10    4.6   4.6   3.6   6.0   0.8
  5.|-- 213.55.155.6                        80.0%    10    3.9   3.8   3.7   3.9   0.1
  6.|-- i68gem-015-ae10.bb.ip-plus.net       0.0%    10    7.2   7.9   4.9  12.7   2.8
  7.|-- i68geb-005-ae11.bb.ip-plus.net       0.0%    10    5.5   5.9   4.9   8.5   1.0
  8.|-- i62bsw-015-hun1-3-1.bb.ip-plus.net   0.0%    10    9.8  13.0   8.3  20.6   4.9
  9.|-- i00iad-005-xe0-0-0x0.bb.ip-plus.net  0.0%    10   90.5  90.6  89.0  92.1   0.9
 10.|-- 14907-dc6-ix.equinix.com             0.0%    10   92.2  91.0  89.3  93.4   1.2
 11.|-- ???                                 100.0    10    0.0   0.0   0.0   0.0   0.0

Change #1174464 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: change nftables rate-limiting policy to accept

https://gerrit.wikimedia.org/r/1174464

Change #1174464 merged by Jelto:

[operations/puppet@production] gitlab: disable nftables rate-limiting temporarily

https://gerrit.wikimedia.org/r/1174464

Change #1174476 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: disable nftables rate-limiting monitoring

https://gerrit.wikimedia.org/r/1174476

Change #1174476 merged by Jelto:

[operations/puppet@production] gitlab: disable nftables rate-limiting monitoring

https://gerrit.wikimedia.org/r/1174476

Change #1174479 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gitlab: binding nft throttling and its monitoring

https://gerrit.wikimedia.org/r/1174479

Change #1174479 merged by Arnaudb:

[operations/puppet@production] gitlab: binding nft throttling and its monitoring

https://gerrit.wikimedia.org/r/1174479

Change #1174417 merged by jenkins-bot:

[operations/cookbooks@master] sre.gitlab.failover: use hostname in wipe-cache

https://gerrit.wikimedia.org/r/1174417

Change #1174976 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab: pause restore on gitlab2002"

https://gerrit.wikimedia.org/r/1174976

Change #1175043 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable nftables throttling again in monitoring mode

https://gerrit.wikimedia.org/r/1175043

Change #1175043 merged by Jelto:

[operations/puppet@production] gitlab: enable nftables throttling again in monitoring mode

https://gerrit.wikimedia.org/r/1175043

Change #1174976 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab: pause restore on gitlab2002"

https://gerrit.wikimedia.org/r/1174976

I'll resolve this task on Monday when restore is enabled again.

Change #1174976 merged by Jelto:

[operations/puppet@production] Revert "gitlab: pause restore on gitlab2002"

https://gerrit.wikimedia.org/r/1174976

The switchover from gitlab2002 to gitlab1004 happened last week. Restore is enabled on gitlab2002 again. I also opened tasks for the followups or improvements (see subtasks). So I'll resolve this task.