The CI servers (Puppet role(ci)) are running Debian Buster which rea and must be upgraded to Bullseye. As of April 2024 the hosts are:
- contint2002.wikimedia.org | Zuul, Zuul merger, Jenkins, Jenkins agent
- contint1002.wikimedia.org | Zuul merger, Jenkins agent
There are dependencies before the upgrade can happen, notably Zuul requires python2.7 which is no more officially supported on Bullseye and is only included for the purpose of building Chromium.
Runbook
The reimaging of the two hosts is done in three phases:
- Reimage contint1002
- Switch over services from contint1002 to contint2002
- Reimage contint2002
1) reimage contint1002.wikimedia.org
We need to bring down the two services on the host, reimage it and bring back the two services. The host is:
- attached as a Jenkins agent to the Jenkins controller which runs on the other host
- running the secondary zuul-merger daemon (and its companion git-daemon)
Disable services
- Disable the zuul-merger on contint1002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint2002.wikimedia.org.
- Disable the Jenkins agent https://integration.wikimedia.org/ci/computer/contint1002/
- Run the host down cookbook to disable monitoring and alarms
- Set the correct docker version for just the host to be reimaged (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020316)
Reimage
- Reimage contint1002 to Bullseye. Data in /srv can be wiped out, they are merely used for caching (git repos, docker images and build layers)
- While cookbook is still running but host is already back up and ssh access has been restored.. manually run "sudo a2dismod mpm_event" and run puppet again. Cookbook should now detect a succesful puppet run and finish cleanly.
Enable services
After host is back and provisioned, verify:
- /srv is a standalone partition!
- Docker daemon is started.
- Zuul has been deployed (not by Puppet): /srv/deployment/zuul/venv/bin/zuul-merger. - FAILED
- git-daemon is up (systemctl status git-daemon).
Enable the services:
- Enable the Jenkins agent via https://integration.wikimedia.org/ci/computer/contint1002/ the ssh host key would need to be verified again since the reimaging causes the host key to change.
- Set profile::zuul::merger::enable: true. Running Puppet will unmask it and start the service. It logs in /var/log/zuul/merger.log.
2) Switch over services
Before reimaging contint2002, we need its services to be moved to the reimage contint1002.
Before the maintenance
- Disable the zuul-merger on contint2002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint1002.wikimedia.org.
- Clean up some of the Jenkins artifacts to reduce the amount of data that will be transfered
Rsync data and states
Synchronize data and states to pre warm the other host:
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
Switch over
- Downtime both contint2002 and contint1002
- Disable Puppet
- Stop the services sudo systemctl stop jenkins and sudo systemctl stop zuul
Rsync data and states
Now that services are stopped, resynchronize all artifacts and states:
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
change DNS
- Change contint.wikimedia.org CNAME from contint2002.wikimedia.org to contint1002.wikimedia.org
change primary host in Puppet/Hiera/CI config
- profile::ci::manager_host: contint1002.wikimedia.org
- In profile::zuul::merger::conf change gearman_server to the IP of contint1002.wikimedia.org: 208.80.153.39
- Run Puppet on contint1002 to point the zuul-merger to the new host
Start services
- Update Zuul config: from integration/config: ./fab deploy_zuul
- Enable and run Puppet on contint1002 which should bring up both Jenkins and Zuul
Verify:
- Jenkins
- Zuul
- https://integration.wikimedia.org/
3) reimage contint2002.wikimedia.org
TODO copy paste 3) reimage contint1002.wikimedia.org checklist here.