The CI servers (Puppet {nav role(ci)}) are running Debian Buster which rea and must be upgraded to Bullseye. As of April 2024 the hosts are:
| contint2002.wikimedia.org | Zuul, Zuul merger, Jenkins, Jenkins agent
| contint1002.wikimedia.org | Zuul merger, Jenkins agent
{icon check color=green} There are dependencies before the upgrade can happen, notably Zuul requires python2.7 which is no more officially supported on Bullseye and is only included for the purpose of building Chromium.
= Runbook =
The reimaging of the two hosts is done in three phases:
# Reimage contint1002
# Switch over services from contint1002 to contint2002
# Reimage contint2002
== 1) reimage contint1002.wikimedia.org ==
We need to bring down the two services on the host, reimage it and bring back the two services. The host is:
* attached as a Jenkins agent to the Jenkins controller which runs on the other host
* running the secondary zuul-merger daemon (and its companion git-daemon)
=== Disable services ===
[x] Disable the zuul-merger on contint1002 by setting `profile::zuul::merger::enable: false`. That should stop and mask the service. There is another Zuul merger system running on contint2002.wikimedia.org.
[x] Disable the Jenkins agent https://integration.wikimedia.org/ci/computer/contint1002/
[x] Run the host down cookbook to disable monitoring and alarms
[x] Set the correct docker version for just the host to be reimaged (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020316)
=== Reimage ===
[x] Reimage contint1002 to Bullseye. Data in `/srv` can be wiped out, they are merely used for caching (git repos, docker images and build layers)
[x] While cookbook is still running but host is already back up and ssh access has been restored.. manually run "sudo a2dismod mpm_event" and run puppet again. Cookbook should now detect a succesful puppet run and finish cleanly.
=== Enable services ===
After host is back and provisioned, verify:
- [ ] `/srv` is a standalone partition!
- [ ] Docker daemon is started.
- [ ] Zuul has been deployed by Puppet: `/srv/deployment/zuul/venv/bin/zuul-merger`.
- [ ] git-daemon is up (`systemctl status git-daemon`).
Enable the services:
[ ] Enable the Jenkins agent via https://integration.wikimedia.org/ci/computer/contint1002/ the ssh host key would need to be verified again since the reimaging causes the host key to change.
[ ] Set `profile::zuul::merger::enable: true`. Running Puppet will unmask it and start the service. It logs in `/var/log/zuul/merger.log`.
== 2) Switch over services ==
Before reimaging contint2002, we need its services to be moved to the reimage contint1002.
=== Before the maintenance ===
[ ] Disable the zuul-merger on contint2002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint1002.wikimedia.org.
[ ] Clean up some of the Jenkins artifacts to reduce the amount of data that will be transfered
==== Rsync data and states
Synchronize data and states to pre warm the other host:
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
=== Switch over ===
[ ] Downtime both contint2002 and contint1002
[ ] Disable Puppet
[ ] Stop the services `sudo systemctl stop jenkins` and `sudo systemctl stop zuul`
==== Rsync data and states
Now that services are stopped, resynchronize all artifacts and states:
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
[ ] sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
==== change DNS
[ ] Change `contint.discovery.wmnet` CNAME from `contint2002.wikimedia.org` to `contint1002.wikimedia.org`
==== change primary host in Puppet/Hiera/CI config ====
[ ] `profile::ci::manager_host: contint1002.wikimedia.org`
[ ] In `profile::zuul::merger::conf` change `gearman_server` to the IP of contint1002.wikimedia.org: `208.80.153.39`
[ ] Run Puppet on contint1002 to point the zuul-merger to the new host
==== Start services ====
[ ] Update Zuul config: from integration/config: `./fab deploy_zuul`
[ ] Enable and run Puppet on contint1002 which should bring up both Jenkins and Zuul
Verify:
- [ ] Jenkins
- [ ] Zuul
- [ ] https://integration.wikimedia.org/
== 3) reimage contint2002.wikimedia.org ==
//TODO copy paste {nav 3) reimage contint1002.wikimedia.org} checklist here.//
== References ==
* {T324659}