This is to track the service implement of serviceops host contint2002 which is the primary CI server with Jenkins/Zuul etc. It is done independently from contint1002 which is simpler (T313832).
topic branch https://gerrit.wikimedia.org/r/q/topic:contint2002
change to this task https://gerrit.wikimedia.org/r/q/bug:T324659
migration checklist
ahead of maintenance
Jenkins:
- Allow contint2002 to ssh/port 22 and rsync/port 873 to integration instances via Horizon security group
- Allow contint2002 to ssh/port 22 to puppet-diffs
- Allow contint2002 to ssh/port 22 to deployment-prep
- Add contint2002 as a Jenkins agent and put it offline
Move zuul-merger from old to new host
- Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/936266
- Run puppet on old host (contint2001) to disable zuul-merger
- Run puppet on new host (contint2002) to enable zuul-merger
- Verify service
- zuul-merger stopped on contint2001, started on contint2002. It is reflected on the Zuul server at Gearman level (zuul-gearman.py workers|grep merger). CI builds are already using the new instance :)
synchronize build artifacts
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint2002.wikimedia.org/ci--srv-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint2002.wikimedia.org/ci--var-lib-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint2002.wikimedia.org/ci--var-lib-zuul-
Stop all services
- downtime both hosts contint2001 contint2002
- sudo cookbook sre.hosts.downtime -r "Switch contint hosts for hardware replacement" -t T324659 -M 60 contint2001.wikimedia.org
- sudo cookbook sre.hosts.downtime -r "Switch contint hosts for hardware replacement" -t T324659 -M 60 contint2002.wikimedia.org
- stop puppet on both hosts sudo disable-puppet "Switch contint hosts for hardware replacement - T324659"
- sudo systemctl stop jenkins and sudo systemctl stop zuul on contint2001
- Verify Jenkins and Zuul are stopped/masked on contint2002
rsync data and states
Now that services are stopped, resynchronize all artifacts and states:
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint2002.wikimedia.org/ci--srv-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint2002.wikimedia.org/ci--var-lib-jenkins-
- sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint2002.wikimedia.org/ci--var-lib-zuul-
change DNS
- merge https://gerrit.wikimedia.org/r/c/operations/dns/+/933196 and run authdns-update
change primary in Puppet / Hiera
- merge ci/zuul: switch gearman server from contint2001 to contint2002
- merge ci/zuul: set contint2002 as the active ci::manager_host
- Run Puppet on contint1002 to point the zuul-merger to the new host
Start services
- step we missed Update the Zuul configuration in /etc/zuul/wikimedia: ./fab deploy_zuul
- enable Puppet on new host sudo enable-puppet "Switch contint hosts for hardware replacement - T324659"
- run Puppet agent on new host (which should apply config changes and bring up both Jenkins and Zuul)
- verify Zuul
- verify Jenkins
Reflect stopped services on old host
- enable puppet again on old host sudo enable-puppet "Switch contint hosts for hardware replacement - T324659"
- run Puppet agent on old host (stop/disable/mask Jenkins, Zuul)
After maintenance
- enable contint2002 Jenkins agent https://integration.wikimedia.org/ci/computer/contint2002/
- merge ci: make contint2002 the new rsync source, remove contint2001
208.80.153.15
- remove contint2001 208.80.153.15 from WMCS security groups:
