One hour before (13:00UTC)
- Add a scap lock on deploy1002.eqiad.wmnet echo "Deployment lock for service switchover - T330651" > /var/lock/scap-global-lock
All services
- Run sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover" --task-id T330651
Deployment server
- Log SAL !log Switch deployment server - T330651
- Run sudo cumin 'R:class = role::deployment_server' 'disable-puppet "Switchover of the deployment server"'
- Merge https://gerrit.wikimedia.org/r/c/operations/dns/+/892372
- Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/892373
- Run puppet on deploy2002.codfw.wmnet sudo cumin deploy2002.codfw.wmnet 'run-puppet-agent --enable "Switchover of the deployment server"'
- Run puppet on all other deployment servers sudo cumin 'R:class = role::deployment_server' 'run-puppet-agent --enable "Switchover of the deployment server"'
- Run puppet on alert*
- Cronjob check sudo cumin deploy2002.codfw.wmnet 'systemctl list-units | grep -A1 sync_deployment_dir'
- Remove lock sudo cumin deploy2002.codfw.wmnet 'rm -v /var/lock/scap-global-lock'
- Test scap deployment cd /srv/mediawiki-staging; scap sync-file README "check the deployment server after switchover"
- Test scap3 deployments work (restbase?)
- Test helmfile deployments
- email ops@ about the switch
restbase-async
A week later (08 March 2023), restore restbase to it's normal state
- Run sudo cookbook sre.discovery.service-route --reason T330651 pool --wipe-cache eqiad restbase-async
- Run sudo cookbook sre.discovery.service-route --reason T330651 depool --wipe-cache codfw restbase-async