Page MenuHomePhabricator

upgrade gitlab hosts to bookworm
Closed, ResolvedPublic

Description

The hosts gitlab1001 (wmcs), gitlab1003, gitlab1004, gitlab2002 and gitlab2003 should be upgraded to bookworm. The package gitlab-ce is available in bookworm now.

The backup, reimage and restore takes several hours. So the replicas should be upgraded first. A new replica can be switched/failed-over as the production host and then the remaining old production host can be upgraded. So some tests of the sre.gitlab.failover cookbook are needed.

Test instance:

  • gitlab1001 (wmcs) -> gitlab1002

In-Setup:

  • gitlab2003

Replicas:

  • gitlab1003
  • gitlab1004

Production (not anymore, after T400252)

  • gitlab2002

Failover:

  • Verify failover works between replicas
  • reduce backup and restore time
    • maybe store packages in object storage temporarily? - T378922
    • skip backup and sync of packages and restore later manually?
    • skip sync of packages (I'm running some measurements at the moment how many packages would be affected here)
  • do a failover for gitlab2002 T400252

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jelto triaged this task as High priority.
Jelto moved this task from Incoming to Work in Progress on the collaboration-services board.

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab2003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab2003.wikimedia.org with OS bookworm completed:

  • gitlab2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507140928_jelto_1658147_gitlab2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1169078 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: install correct gitlab-ce package on bookworm

https://gerrit.wikimedia.org/r/1169078

Change #1169078 merged by Jelto:

[operations/puppet@production] gitlab: install correct gitlab-ce package on bookworm

https://gerrit.wikimedia.org/r/1169078

The new test instance gitlab-1002 is running bookworm. I'll do a few more tests but it looks promising so far.

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab1003.wikimedia.org with OS bookworm completed:

  • gitlab1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507150848_jelto_1796834_gitlab1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm completed:

  • gitlab1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507160750_jelto_1931446_gitlab1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm executed with errors:

  • gitlab1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gitlab1004.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm executed with errors:

  • gitlab1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gitlab1004.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab1004.wikimedia.org with OS bookworm completed:

  • gitlab1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507160938_jelto_1944197_gitlab1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Host gitlab1004.wikimedia.org rebooted by jelto@cumin1003 with reason: None

Host gitlab1004.wikimedia.org rebooted by jelto@cumin1003 with reason: None

gitlab1004 was reimaged (multiple times because of issues with the partman config T399714). Currently the raid is syncing and 50% done for the first raid. I'll leave the sync running over night.

Host gitlab1004.wikimedia.org rebooted by jelto@cumin1003 with reason: None

Host gitlab1004.wikimedia.org rebooted by jelto@cumin1003 with reason: None

Both replicas were updated to bookworm. Reimage of gitlab1004 was a bit tricky because of problems with the partman config T399714: GitLab partman config is unreliable. After the 3rd reimage the raids were configured properly (mostly).

So the next step is to verify the failover cookbook with a failover between the replicas. I'd like to try a failover without packages to reduce the runtime significantly. New packages are created occasionally (like every few hours). The trade-off of loosing a few packages is better than 5h+ downtime for all users. The packages could be re-created or manually copied later on.

In the graph below you can see the rate of new package creations is quite low and there are long stretches with zero new packages. Maybe there is a good spot (beside weekends and nights) to to the failover and reduce the chance of missing packages:

package_additions_deletions.jpg (980×3 px, 233 KB)

I've done some more benchmarks, a backup without artifacts and packages takes roughly 15 minutes at the moment (compared to 3+ hours for a full backup). So my proposal would be to just skip backup and transfer of packages (artifacts are on object storage already) and risk loosing a few of those during a shorter maintenance window.

One problem is a recent bug in GitLab which increases the restore time by roughly 15-30 minutes. Restores take around 45 minutes typically but can take more than 90 minutes when the bug occurs. Also I have to check if the restore time is also shorter with a backup which does not contain packages.

The estimated downtime for the switchover without packages is around 60-120 minutes. So I think it's reasonable to announce a 2 hour downtime for the failover.

  • the backup takes around 15 minutes
  • the sync of the ~35gb backup takes 5 minutes
  • the restore takes 40 minutes, worst case 90 minutes
  • a few minutes for merging DNS and puppet
  • some buffer

I think this option is better than a full backup and restore cycle with over 5 hours.

I also verified old packages are still available on the replica, for example here: https://gitlab-replica-a.wikimedia.org/groups/repos/-/packages

Change #1171554 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: exclude packages from failover backup

https://gerrit.wikimedia.org/r/1171554

Change #1171554 merged by Arnaudb:

[operations/puppet@production] gitlab: exclude packages from failover backup

https://gerrit.wikimedia.org/r/1171554

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab2002.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab2002.wikimedia.org with OS bookworm executed with errors:

  • gitlab2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gitlab2002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1003 for host gitlab2002.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1003 for host gitlab2002.wikimedia.org with OS bookworm completed:

  • gitlab2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508050959_jelto_670673_gitlab2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

All gitlab hosts are on bookworm now, I'll close the task.

The test instance gitlab-1001 which is running debian-11.0-bullseye is shut off since 2 month. I'll delete the instance from devtools as a last cleanup step.