Page MenuHomePhabricator

upgrade contint servers to bullseye
Open, In Progress, HighPublic

Description

The CI servers (Puppet role(ci)) are running Debian Buster which rea and must be upgraded to Bullseye. As of April 2024 the hosts are:

  • contint2002.wikimedia.org | Zuul, Zuul merger, Jenkins, Jenkins agent
  • contint1002.wikimedia.org | Zuul merger, Jenkins agent

There are dependencies before the upgrade can happen, notably Zuul requires python2.7 which is no more officially supported on Bullseye and is only included for the purpose of building Chromium.

Runbook

The reimaging of the two hosts is done in three phases:

  1. Reimage contint1002
  2. Switch over services from contint1002 to contint2002
  3. Reimage contint2002

1) reimage contint1002.wikimedia.org

We need to bring down the two services on the host, reimage it and bring back the two services. The host is:

  • attached as a Jenkins agent to the Jenkins controller which runs on the other host
  • running the secondary zuul-merger daemon (and its companion git-daemon)

Disable services

  • Disable the zuul-merger on contint1002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint2002.wikimedia.org.
  • Run the host down cookbook to disable monitoring and alarms

Reimage

  • Reimage contint1002 to Bullseye. Data in /srv can be wiped out, they are merely used for caching (git repos, docker images and build layers)
  • While cookbook is still running but host is already back up and ssh access has been restored.. manually run "sudo a2dismod mpm_event" and run puppet again. Cookbook should now detect a succesful puppet run and finish cleanly.

Enable services

After host is back and provisioned, verify:

  • /srv is a standalone partition!
  • Docker daemon is started.
  • Zuul has been deployed (not by Puppet): /srv/deployment/zuul/venv/bin/zuul-merger. - FAILED
  • git-daemon is up (systemctl status git-daemon).

Enable the services:

  • Enable the Jenkins agent via https://integration.wikimedia.org/ci/computer/contint1002/ the ssh host key would need to be verified again since the reimaging causes the host key to change.
  • Set profile::zuul::merger::enable: true. Running Puppet will unmask it and start the service. It logs in /var/log/zuul/merger.log.

2) Switch over services

Before reimaging contint2002, we need its services to be moved to the reimage contint1002.

Before the maintenance

  • Disable the zuul-merger on contint2002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint1002.wikimedia.org.
  • Clean up some of the Jenkins artifacts to reduce the amount of data that will be transfered
Rsync data and states

Synchronize data and states to pre warm the other host:

  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-

Switch over

  • Downtime both contint2002 and contint1002
  • Disable Puppet
  • Stop the services sudo systemctl stop jenkins and sudo systemctl stop zuul
Rsync data and states

Now that services are stopped, resynchronize all artifacts and states:

  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
change DNS
  • Change contint.wikimedia.org CNAME from contint2002.wikimedia.org to contint1002.wikimedia.org
change primary host in Puppet/Hiera/CI config
  • profile::ci::manager_host: contint1002.wikimedia.org
  • In profile::zuul::merger::conf change gearman_server to the IP of contint1002.wikimedia.org: 208.80.153.39
  • Run Puppet on contint1002 to point the zuul-merger to the new host
Start services
  • Update Zuul config: from integration/config: ./fab deploy_zuul
  • Enable and run Puppet on contint1002 which should bring up both Jenkins and Zuul

Verify:

3) reimage contint2002.wikimedia.org

TODO copy paste 3) reimage contint1002.wikimedia.org checklist here.

References

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Change 914731 had a related patch set uploaded (by Dzahn; author: Hashar):

[operations/puppet@production] Use same php version for doc and integration websites

https://gerrit.wikimedia.org/r/914731

Change 914731 merged by Dzahn:

[operations/puppet@production] Use same php version for doc and integration websites

https://gerrit.wikimedia.org/r/914731

Today we upgraded the PHP version from 7.3 to 7.4 on all contint hosts. more details are at T294276#8904580

Hey @hashar @thcipriani @LSobanski @Jdforrester-WMF Let me summarize the situation now. We have 3 machines and they are:

  • 3 of 3 are using PHP 7.4 (since today) (T334954)
  • 3 of 3 are on buster (somehow I thought the new unused one was bullseye already, but no and I forgot why but there was some reason) (T324659)
  • 2 of 3 are hardware under warranty, one is old and needs to go. unfortunately that is also the main prod one (T294276)
  • 3 of 3 should be upgraded to bullseye asap (T327068)

Given this, I think we should next:

  • reimage contint2002 with bullseye (new but unused hardware, no server name change)
  • reimage contint1002 with bullseye (current cold-standby / failover server in prod but not really used? but is it really not used at all or similar to gerrit-replica)
  • test if we like everything
  • make contint2002 the new prod server
  • shut down contint2001
  • be happy having solved both "hardware refresh" and got rid of buster

I hope today's 7.3 to 7.4 PHP upgrade is helpful to let us do this soon with (more) confidence.

Thoughts?

LGTM. One question to @hashar is whether we still want the primary to be in codfw or would it be better to move to eqiad as part of this?

One question to @hashar is whether we still want the primary to be in codfw or would it be better to move to eqiad as part of this?

In short I don't know, there is a long tail of checks that needs to happen for the Bullseye upgrade and I haven't checked any of them yet. From the top of my mind:

  • Java 11 and Jenkins (hopefully straightforward)
  • The impact of the git upgrade (T335354) and changes to git-daemon
  • Python2.7 for Zuul and Zuul-merger given python2.7 is "no more" in Bullseye (Debian #975014. Moritz and I already talked about it. I need to:
    • check the Python modules required by Zuul and vendor them in the deploy repo
    • overhaul how it is deployed (the current system is based on some Makefile from 2017) and the deploy repo hasn't been touched in 3 years.

reimage contint1002 with bullseye (current cold-standby / failover server in prod but not really used? but is it really not used at all or similar to gerrit-replica)

Same as the Gerrit replica, contint1002 is a production serve and in active use. It supports the Zuul merger and is a Jenkins agent. Those services are active on both hosts.

test if we like everything
make contint2002 the new prod server

That one would additionally run the Jenkins controller and Zuul scheduler. So that is rather risky to switch them from an host to another host AND from Buster to Bullseye.

It sounds handy to keep contint2001 around for the purpose of the OS upgrade.

I'd need to think about it.

There is also T324659 to make it possible to switch over the services from host to host since last time that caused major havoc and is the reason the services are still on codfw.

I have simplified the task since there was a lot of overlap with the hardware refresh. Notably:

It can or can not be mixed with the hardware replacement but we should not wait too long for either of these things.

From past experience, we should not mix up a hardware replacement with an OS upgrade. That causes too overlapping problems. Anyway both servers have been replaced so we can now do the Bullseye update once all blockers have been investigated / addressed. I had some identified above at T334517#8905950 and would certainly file them as subtasks to cut the spam here :)

Change 987458 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: use php7.4 on bullseye just like on buster

https://gerrit.wikimedia.org/r/987458

Change 987458 abandoned by Dzahn:

[operations/puppet@production] contint: use the same PHP packages on contint before and after distro upgrade

Reason:

discussed in meeting - we are going to use the Debian packages on bullseye

https://gerrit.wikimedia.org/r/987458

Mentioned in SAL (#wikimedia-cloud) [2024-04-11T19:03:05Z] <mutante> - deleting instance contint-bullseye which was only used by me for a test before we created contint1003 in prod T334517 T361224

Dzahn changed the task status from Open to In Progress.Tue, Apr 16, 5:26 PM
Dzahn claimed this task.

We had another meeting about this today and we said to:

  • reimage contint1002 any time
  • enable rsyncing of data in /srv/jenkins from contint2002 to contint1002 in puppet code
  • pre-sync files with manual rsync after puppet changes allow it
  • prepare patches needed for switch-over of the active server from contint2002 to contint1002
  • schedule switch-over for next week
  • do the switch-over, merge all the patches prepared before maintenance window
  • if everything works on bullseye reimage contint2002
  • .. switch back or not?

Change #1020296 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: disable zuul merger on contint1002, preparing for reimage

https://gerrit.wikimedia.org/r/1020296

Change #1020296 merged by Dzahn:

[operations/puppet@production] contint: disable zuul merger on contint1002, preparing for reimage

https://gerrit.wikimedia.org/r/1020296

zuul-merger on contint1002 has been masked.

Loaded: masked (Reason: Unit zuul-merger.service is masked.)

James marked the jenkis as "temp offline" on contint1002.

Downtiming and reimaging now.

Mentioned in SAL (#wikimedia-operations) [2024-04-16T17:48:44Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on contint1002.wikimedia.org with reason: reimage https://phabricator.wikmedia.org/T334517

Mentioned in SAL (#wikimedia-operations) [2024-04-16T17:48:59Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1002.wikimedia.org with reason: reimage https://phabricator.wikmedia.org/T334517

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint1002.wikimedia.org with OS bullseye

Error: Could not enable docker: 
Error: /Stage[main]/Profile::Ci::Docker/Service[docker]/enable: change from 'false' to 'true' failed: Could not enable docker: 

...

Error: '/usr/sbin/a2enmod php7.4' returned 1 instead of one of [0]
Error: /Stage[main]/Httpd/Httpd::Mod_conf[php7.4]/Exec[ensure_present_mod_php7.4]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/a2enmod php7.4' returned 1 instead of one of [0] (corrective)
Error: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install docker-ce=5:20.10.12~3-0~debian-bullseye' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
W: --force-yes is deprecated, use one of the options starting with --allow instead.
E: Version '5:20.10.12~3-0~debian-bullseye' for 'docker-ce' was not found
Error: /Stage[main]/Profile::Ci::Docker/Package[docker-ce]/ensure: change from 'purged' to '5:20.10.12~3-0~debian-bullseye' failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install docker-ce=5:20.10.12~3-0~debian-bullseye' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
W: --force-yes is deprecated, use one of the options starting with --allow instead.
E: Version '5:20.10.12~3-0~debian-bullseye' for 'docker-ce' was not found

Change #1020316 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: set bullseye docker version just for host contint1002

https://gerrit.wikimedia.org/r/1020316

Change #1020316 merged by Dzahn:

[operations/puppet@production] contint: set bullseye docker version just for host contint1002

https://gerrit.wikimedia.org/r/1020316

Mentioned in SAL (#wikimedia-operations) [2024-04-16T18:40:01Z] <mutante> contint1002 - sudo a2dismod mpm_event to work around known race condition and fix failed initial puppet run - T334517

Error: '/usr/sbin/a2enmod php7.4' returned 1 instead of one of [0]
Error: /Stage[main]/Httpd/Httpd::Mod_conf[php7.4]/Exec[ensure_present_mod_php7.4]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/a2enmod php7.4' returned 1 instead of one of [0] (corrective)

Known race condition that keeps happening on many hosts with puppetized apache.

Fixed with manual sudo a2dismod mpm_event followed by another manual puppet run.

E: Version '5:20.10.12~3-0~debian-bullseye' for 'docker-ce' was not found

Fixed by setting the correct version just for this host name without changing the default.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020316

Now a puppet run shows no more errors and since I was able to apply these fixes while the cookbook was still running and trying to detect a succesful puppet run.. it SHOULD not fail the cookbook now.

[9/60, retrying in 270.00s] Attempt to run 'spicerack.puppet.PuppetHosts.wait_since' raised: Unable to find a successful Puppet run
Caused by: Cumin execution failed (exit_code=2)

Waiting for 10/60 to pick up the fix.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint1002.wikimedia.org with OS bullseye completed:

  • contint1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404161807_dzahn_1496931_contint1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

verifying that:

  • /srv is a standalone partition! - /dev/mapper/vg0-srv on /srv type ext4 (rw,relatime) ✅
  • Docker daemon is started. - Active: active (running) since Tue 2024-04-16 18:38:19 UTC; 14min ago ✅
  • Zuul has been deployed by Puppet: /srv/deployment/zuul/venv/bin/zuul-merger - /srv/deployment/zuul/venv/bin/zuul-merger' (No such file or directory) ✗
  • git-daemon is up - Active: active (running) since Tue 2024-04-16 18:21:13 UTC; 36min ago ✅

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:12:27Z] <hashar@deploy1002> Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:12:32Z] <hashar@deploy1002> Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:14:35Z] <hashar@deploy1002> Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:14:49Z] <hashar@deploy1002> Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 13s)

Change #1020329 had a related patch set uploaded (by Dzahn; author: Hashar):

[operations/puppet@production] zuul: require python2.7

https://gerrit.wikimedia.org/r/1020329

Change #1020329 merged by Dzahn:

[operations/puppet@production] zuul: require python2.7

https://gerrit.wikimedia.org/r/1020329

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:47:44Z] <hashar@deploy1002> Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:47:52Z] <hashar@deploy1002> Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 08s)

Change #1020344 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: set new default docker version for bullseye

https://gerrit.wikimedia.org/r/1020344

Mentioned in SAL (#wikimedia-operations) [2024-04-16T20:30:21Z] <mutante> CI - jenkins and zuul-merger are re-enabled on contint1002 after distro upgrade to bullseye - T334517

kind of weird: contint1002 and contint2002 both have rsyncd running and rsync snippets once created by puppet but I can't find the code in puppet and there are references to old server contint2001 that are nowhere in the repo either.

it's almost like puppetized rsync existed and was removed and just the remnants weren't deleted by puppet.

data pathes and sizes:

existing primary server:

root@contint2002:/# du -hs /var/lib/jenkins
2.2G	/var/lib/jenkins
root@contint2002:/# du -hs /var/lib/zuul/
6.0M	/var/lib/zuul/
root@contint2002:/# du -hs /srv/jenkins
291G	/srv/jenkins

reimaged server:

root@contint1002:/# du -hs /var/lib/jenkins
3.0M	/var/lib/jenkins
root@contint1002:/# du -hs /var/lib/zuul/
44K	/var/lib/zuul/
root@contint1002:/# du -hs /srv/jenkins
4.0K	/srv/jenkins

Mentioned in SAL (#wikimedia-operations) [2024-04-17T23:14:01Z] <mutante> rsyncing jenkins data from contint2002 to contint1002, pre-sync in preparation for migration next week - /srv/jenkins (291G) and much smaller zuul and jenkins data dirs T334517

Change #1020950 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: disable zuul merger on contint2002 for migration

https://gerrit.wikimedia.org/r/1020950

Change #1020951 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch contint.wikimedia.org from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020951

Change #1020954 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch contint manager_host from 2002 to 1002

https://gerrit.wikimedia.org/r/1020954

bugs in runbook:

  • there is no contint.discovery.wmnet name - it's contint.wikimedia.org
  • there is also a "gearman_server" setting which is an IP address hardcoded in Hiera and it's not mentioned in the runbook (profile::zuul::merger::conf)
  • doesn't mention rsync source and dest host settings in Hiera, which need to be switched with the failover
  • at the end, delete host-name based special settings like the docker version that are now globally the same again (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020316)

Change #1020955 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch gearman_server IP from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020955

Change #1020957 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch source and destination server for data rsync

https://gerrit.wikimedia.org/r/1020957

Change #1020958 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org

https://gerrit.wikimedia.org/r/1020958

  • confirmed rsync commands are working and executed them as noted in the runbook above - /srv/jenkins , /var/lib/zuul and /var/lib/jenkins have been pre-synced to contint1002