Following up on T387833: Gerrit switchover process we'll need to reimage gerrit1003 on Debian 12.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T407557 OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm | |||
| Open | None | T407844 Gerrit ssh daemon does not offer post-quantum kex leading to a warning with OpenSSH 10 | |||
| Restricted Task | |||||
| Open | None | T392448 Upgrade to Gerrit 3.12 | |||
| Open | None | T379714 Upgrade to Gerrit 3.11 | |||
| Open | None | T392465 Switch Gerrit from Java 17 to Java 21 | |||
| Open | None | T384595 Upgrade Collab hosts to Bookworm | |||
| Resolved | ABran-WMF | T392464 Upgrade Gerrit hosts from Bullseye to Bookworm | |||
| Open | None | T387831 Standardize failover procedures for Collab services | |||
| Resolved | None | T393239 ProbeDown | |||
| Resolved | ABran-WMF | T387833 Gerrit switchover process | |||
| Resolved | ABran-WMF | T417246 Reimage gerrit1003 | |||
| Resolved | dancy | T417767 scap error on Gerrit first setup | |||
| Resolved | ABran-WMF | T417777 SystemdUnitFailed - apache2.service on gerrit1003:9100 | |||
| Duplicate | None | T417968 ProbeDown | |||
| Duplicate | ABran-WMF | T417966 GerritHAProxyBackendUnavailable |
Event Timeline
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm
Change #1240219 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: change daemon user for gerrit1003
Change #1240219 merged by Arnaudb:
[operations/puppet@production] gerrit: change daemon user for gerrit1003
I missed a spot in puppet:
$ grep gerrit /etc/passwd 27:gerrit2:x:925:925::/var/lib/gerrit2:/bin/bash
Change #1240230 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: change system user for gerrit1003
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm executed with errors:
- gerrit1003 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602180953_arnaudb_2362553_gerrit1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gerrit1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.
Change #1240230 merged by Arnaudb:
[operations/puppet@production] gerrit: change system user for gerrit1003
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm
missed spot cleaned up:
arnaudb@gerrit1003:~ $ grep gerrit /etc/passwd 27:gerrit:x:925:925::/var/lib/gerrit:/bin/bash
Mentioned in SAL (#wikimedia-operations) [2026-02-18T13:24:56Z] <arnaudb@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit1003.wikimedia.org with reason: T417246
I would not expect a reimage to work while the puppet role is applied. Recommend moving to "insetup" role, reimage and then apply gerrit prod role again.
Errors like the scap error expected and making it work is one thing but trying to "make it work on 1st puppet run" is another level. That would take a lot and end of the day it only matters once every few years.
Change #1240306 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: move gerrit1003 to insetup
Change #1240306 merged by Arnaudb:
[operations/puppet@production] gerrit: move gerrit1003 to insetup
Change #1240314 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: move gerrit1003 to insetup
Change #1240314 merged by Arnaudb:
[operations/puppet@production] gerrit: move gerrit1003 to insetup
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm executed with errors:
- gerrit1003 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602181119_arnaudb_2374336_gerrit1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gerrit1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm completed:
- gerrit1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602181634_arnaudb_2665920_gerrit1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change #1240471 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: change gerrit1003 role
Change #1240471 merged by Arnaudb:
[operations/puppet@production] gerrit: change gerrit1003 role
Change #1240477 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] java: update puppet certificate
Change #1240648 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] backup: Temporarilly ignore backup job failures from gerrit1003
Change #1240648 merged by Jcrespo:
[operations/puppet@production] backup: Temporarily ignore backup job failures from gerrit1003
Change #1240477 merged by Arnaudb:
[operations/puppet@production] java: update puppet certificate
Change #1240689 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Change #1240689 merged by Arnaudb:
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Once gerrit1003 has been repooled, replication_log raised the following issue:
[2026-02-20 07:38:34,958] Cannot replicate to gerrit@gerrit1003.wikimedia.org:/srv/gerrit/git/design/codex-php.git [CONTEXT pushOneId="b439c1e0" ] org.eclipse.jgit.errors.TransportException: gerrit@gerrit1003.wikimedia.org:/srv/gerrit/git/design/codex-php.git: [ssh-connection]: Failed (UnsupportedCredentialItem) to execute: ssh://gerrit@gerrit1003.wikimedia.org:22: org.eclipse.jgit.transport.CredentialItem$YesNoType:Accept this key and continue connecting all the same?
Change #1240846 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: gerrit-spare lfs-sync enable
after a fix ad-hoc:
gerrit@gerrit2003:~$ diff ~/.ssh/known_hosts ~/.ssh/known_hosts.bak 3a4 > gerrit1003.wikimedia.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbcAskAyVZGB+o358NsGwsiiSc/Dd2nZm1pPNAXOFpXkiOSQ7K6eRzThJRV82VUE/ypNXAGhgwIpW3HOxpgb03FUTetZHKseA2Q3oYJ1MzVLCj+C9QUWPRA5FsT3R8f/1+fDNL00v+X3unFmQ2huxSdeL4jKHpO9GxMSF50cx6qseJGmdv+Ry75vbi+CNte9kpsUbllknBGNH4/yhxXjIypPiZQWPT1tq/YZsCtvO4zn+4NXk7016jeZTEn/SLXiqeNDiTD01HNkb8S+8EZym/hD0tUTjW8i7xYcqPBIKEekGJvdbi8nFl41Vy/jpXZkjgPUL1y9124fGSS8jyMhC/ 5,7d5 < gerrit1003.wikimedia.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCKt0oPPdZcPbx/aTRCBbBEtSUuXXU1Vgbq7tOcZlSWhcqEJ7SDrXl8lduYh+7ivcQdwYo/NszXv0jyO6n+mObd6Wu3/Pzkw5/aB9xnWmmNut+xDn21MqyaGuJ6kJjJsrvm4VG+2GJ5DiTigwSDMLwRoKUNCPAV+RITM7tCIAb/7Kf1PkCPQDAP+Zhvbe0ISEojJ4/Sw/aMDCyeP42R+EIknaFg610DP0sN/Imy2LmMJf+o8xzWg48NWnWjOMRE9q8ktxiawXNksZaeDCMJdnrHNKLWl3vgSqVrIqspIP45OuVuN/3NSrRZ4KbR3ctB87pjU/64CySXOAFbG7c8uHmL < gerrit1003.wikimedia.org ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFIw3OVtnjX1xBTN7JWXxbr79TJuopjxQ1vtca1uqTP5Qjjphr5SHxk70OUVfmnEWj+cfvLp/e82XUj/2vsUv9c= < gerrit1003.wikimedia.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFHUgsuuPPnGS/9zs9UzLxput0iASwIt8/x9TZDj/4Zf
Change #1240848 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: fix gerrit1003 ssh fingerprint
Change #1240846 merged by Arnaudb:
[operations/puppet@production] gerrit: gerrit-spare lfs-sync enable
Change #1240848 merged by Arnaudb:
[operations/puppet@production] gerrit: fix gerrit1003 ssh fingerprint