Page MenuHomePhabricator

Reimage gerrit1003
Closed, ResolvedPublic

Description

Following up on T387833: Gerrit switchover process we'll need to reimage gerrit1003 on Debian 12.

Event Timeline

ABran-WMF moved this task from Incoming to Backlog on the collaboration-services board.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm

Change #1240219 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: change daemon user for gerrit1003

https://gerrit.wikimedia.org/r/1240219

Change #1240219 merged by Arnaudb:

[operations/puppet@production] gerrit: change daemon user for gerrit1003

https://gerrit.wikimedia.org/r/1240219

I missed a spot in puppet:

$ grep gerrit /etc/passwd
27:gerrit2:x:925:925::/var/lib/gerrit2:/bin/bash

Change #1240230 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: change system user for gerrit1003

https://gerrit.wikimedia.org/r/1240230

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm executed with errors:

  • gerrit1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602180953_arnaudb_2362553_gerrit1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gerrit1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Change #1240230 merged by Arnaudb:

[operations/puppet@production] gerrit: change system user for gerrit1003

https://gerrit.wikimedia.org/r/1240230

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm

missed spot cleaned up:

arnaudb@gerrit1003:~ $ grep gerrit /etc/passwd
27:gerrit:x:925:925::/var/lib/gerrit:/bin/bash

Mentioned in SAL (#wikimedia-operations) [2026-02-18T13:24:56Z] <arnaudb@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit1003.wikimedia.org with reason: T417246

I would not expect a reimage to work while the puppet role is applied. Recommend moving to "insetup" role, reimage and then apply gerrit prod role again.

Errors like the scap error expected and making it work is one thing but trying to "make it work on 1st puppet run" is another level. That would take a lot and end of the day it only matters once every few years.

Change #1240306 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: move gerrit1003 to insetup

https://gerrit.wikimedia.org/r/1240306

Change #1240306 merged by Arnaudb:

[operations/puppet@production] gerrit: move gerrit1003 to insetup

https://gerrit.wikimedia.org/r/1240306

Change #1240314 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: move gerrit1003 to insetup

https://gerrit.wikimedia.org/r/1240314

Change #1240314 merged by Arnaudb:

[operations/puppet@production] gerrit: move gerrit1003 to insetup

https://gerrit.wikimedia.org/r/1240314

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm executed with errors:

  • gerrit1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602181119_arnaudb_2374336_gerrit1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console gerrit1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit1003.wikimedia.org with OS bookworm completed:

  • gerrit1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602181634_arnaudb_2665920_gerrit1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1240471 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: change gerrit1003 role

https://gerrit.wikimedia.org/r/1240471

Change #1240471 merged by Arnaudb:

[operations/puppet@production] gerrit: change gerrit1003 role

https://gerrit.wikimedia.org/r/1240471

Change #1240477 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] java: update puppet certificate

https://gerrit.wikimedia.org/r/1240477

Change #1240648 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Temporarilly ignore backup job failures from gerrit1003

https://gerrit.wikimedia.org/r/1240648

Change #1240648 merged by Jcrespo:

[operations/puppet@production] backup: Temporarily ignore backup job failures from gerrit1003

https://gerrit.wikimedia.org/r/1240648

Change #1240477 merged by Arnaudb:

[operations/puppet@production] java: update puppet certificate

https://gerrit.wikimedia.org/r/1240477

Change #1240689 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1240689

ABran-WMF changed the task status from Open to Stalled.Feb 19 2026, 1:57 PM
ABran-WMF moved this task from Work in Progress to Awaiting Input on the collaboration-services board.

moving to stalled until https://gerrit.wikimedia.org/r/c/1240689 is merged

Change #1240689 merged by Arnaudb:

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1240689

ABran-WMF changed the task status from Stalled to In Progress.Feb 20 2026, 7:41 AM

Once gerrit1003 has been repooled, replication_log raised the following issue:

[2026-02-20 07:38:34,958] Cannot replicate to gerrit@gerrit1003.wikimedia.org:/srv/gerrit/git/design/codex-php.git [CONTEXT pushOneId="b439c1e0" ]
org.eclipse.jgit.errors.TransportException: gerrit@gerrit1003.wikimedia.org:/srv/gerrit/git/design/codex-php.git: [ssh-connection]: Failed (UnsupportedCredentialItem) to execute: ssh://gerrit@gerrit1003.wikimedia.org:22: org.eclipse.jgit.transport.CredentialItem$YesNoType:Accept this key and continue connecting all the same?

Change #1240846 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: gerrit-spare lfs-sync enable

https://gerrit.wikimedia.org/r/1240846

after a fix ad-hoc:

gerrit@gerrit2003:~$ diff ~/.ssh/known_hosts ~/.ssh/known_hosts.bak
3a4
> gerrit1003.wikimedia.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbcAskAyVZGB+o358NsGwsiiSc/Dd2nZm1pPNAXOFpXkiOSQ7K6eRzThJRV82VUE/ypNXAGhgwIpW3HOxpgb03FUTetZHKseA2Q3oYJ1MzVLCj+C9QUWPRA5FsT3R8f/1+fDNL00v+X3unFmQ2huxSdeL4jKHpO9GxMSF50cx6qseJGmdv+Ry75vbi+CNte9kpsUbllknBGNH4/yhxXjIypPiZQWPT1tq/YZsCtvO4zn+4NXk7016jeZTEn/SLXiqeNDiTD01HNkb8S+8EZym/hD0tUTjW8i7xYcqPBIKEekGJvdbi8nFl41Vy/jpXZkjgPUL1y9124fGSS8jyMhC/
5,7d5
< gerrit1003.wikimedia.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCKt0oPPdZcPbx/aTRCBbBEtSUuXXU1Vgbq7tOcZlSWhcqEJ7SDrXl8lduYh+7ivcQdwYo/NszXv0jyO6n+mObd6Wu3/Pzkw5/aB9xnWmmNut+xDn21MqyaGuJ6kJjJsrvm4VG+2GJ5DiTigwSDMLwRoKUNCPAV+RITM7tCIAb/7Kf1PkCPQDAP+Zhvbe0ISEojJ4/Sw/aMDCyeP42R+EIknaFg610DP0sN/Imy2LmMJf+o8xzWg48NWnWjOMRE9q8ktxiawXNksZaeDCMJdnrHNKLWl3vgSqVrIqspIP45OuVuN/3NSrRZ4KbR3ctB87pjU/64CySXOAFbG7c8uHmL
< gerrit1003.wikimedia.org ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFIw3OVtnjX1xBTN7JWXxbr79TJuopjxQ1vtca1uqTP5Qjjphr5SHxk70OUVfmnEWj+cfvLp/e82XUj/2vsUv9c=
< gerrit1003.wikimedia.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFHUgsuuPPnGS/9zs9UzLxput0iASwIt8/x9TZDj/4Zf

Change #1240848 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: fix gerrit1003 ssh fingerprint

https://gerrit.wikimedia.org/r/1240848

Change #1240846 merged by Arnaudb:

[operations/puppet@production] gerrit: gerrit-spare lfs-sync enable

https://gerrit.wikimedia.org/r/1240846

Change #1240848 merged by Arnaudb:

[operations/puppet@production] gerrit: fix gerrit1003 ssh fingerprint

https://gerrit.wikimedia.org/r/1240848

This can be considered resolved, replication is running