Following up on T387833: Gerrit switchover process we'll need to reimage gerrit2002 on Debian 12.
Description
Details
Event Timeline
Change #1240689 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Change #1240689 merged by Arnaudb:
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Change #1242268 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/dns@master] gerrit: swap gerrit-replica and gerrit-spare
Change #1242269 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: swap gerrit-spare and gerrit-replica
Change #1242272 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: disable service on gerrit2002 to reimage
Change #1242275 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: prepare replication resume for gerrit2002
Change #1242279 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Change #1242268 merged by Arnaudb:
[operations/dns@master] gerrit: swap gerrit-replica and gerrit-spare
Change #1242269 merged by Arnaudb:
[operations/puppet@production] gerrit: swap gerrit-spare and gerrit-replica
Change #1242272 merged by Arnaudb:
[operations/puppet@production] gerrit: disable service on gerrit2002 to reimage
Change #1243119 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/dns@master] gerrit: fix discovery record
Change #1243119 merged by Arnaudb:
[operations/dns@master] gerrit: fix discovery record
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm
Change #1243131 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: resume replication on gerrit-spare
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm completed:
- gerrit2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602241354_arnaudb_766456_gerrit2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change #1242275 merged by Arnaudb:
[operations/puppet@production] gerrit: migrate gerrit2 system user to gerrit
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm
I've updated the relation chain here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242275
We were putting gerrit2002 back in the role(gerrit) puppet group too soon for the migration to happen properly.
There is one more step in that chain to handle that
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm completed:
- gerrit2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602241443_arnaudb_816182_gerrit2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change #1242279 merged by Arnaudb:
[operations/puppet@production] gerrit: install gerrit and sync-instances
Change #1243131 merged by Arnaudb:
[operations/puppet@production] gerrit: resume replication on gerrit-spare
replication has been warmed up, @Dzahn will handle the last step:
sudo cookbook sre.gerrit.restart-gerrit -t T417247 --host gerrit2003
because gerrit replication has been configured and the primary instance needs to be restarted to handle the new config change
gerrit2002 is now gerrit-spare https://grafana.wikimedia.org/goto/Zoho3OOvg?orgId=1, replication is catching up
Backups from gerrit2002 are failing with:
Could not stat "/var/lib/gerrit": ERR=No such file or directory
Same for: gerrit2003.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data ¿Is this expected?
It would make sense if it was "/var/lib/gerrit2: No such file or directory".
[gerrit2002:/var/lib] $ cd /var/lib/gerrit2 -bash: cd: /var/lib/gerrit2: No such file or directory [gerrit2002:/var/lib] $ cd /var/lib/gerrit [gerrit2002:/var/lib/gerrit] $
since we got rid of the "2" in "gerrit2".
Our backup::set definition contains both "gerrit" and "gerrit2" pathes because for a transitional period both existed.
same on gerrit2003. /var/lib/gerrit2 is gone, /var/lib/gerrit is not. the exact opposite of what that error message seems to say.
Change #1243183 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] backup: adjust gerrit file set after renaming of gerrit2
Then that message may be misleading and not the cause of the issues, but the error is real:
Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 684387 Incr 0 0 Error 24-Feb-26 14:03 gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data
This is the real error:
24-Feb 15:00 backup1014.eqiad.wmnet JobId 684390: Start Backup JobId 684390, Job=gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data.2026-02-24_15.00.00_52 24-Feb 15:00 backup1014.eqiad.wmnet JobId 684390: Using Device "FileStorageReposEqiad" to write. 24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Fatal error: bsockcore.c:208 Unable to connect to Client: gerrit2002.wikimedia.org-fd on gerrit2002.wikimedia.org:9102. E RR=Interrupted system call 24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Fatal error: No Job status returned from FD. 24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Error: Bacula backup1014.eqiad.wmnet 9.6.7 (10Dec20): Build OS: x86_64-pc-linux-gnu debian bookworm/sid JobId: 684390 Job: gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data.2026-02-24_15.00.00_52 Backup Level: Incremental, since=2026-02-24 13:00:03 Client: "gerrit2002.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid FileSet: "gerrit-repo-data" 2024-11-06 18:00:00 Pool: "ReposEqiad" (From Job resource) Catalog: "production" (From Client resource) Storage: "backup1012-FileStorageReposEqiad" (From Pool resource) Scheduled time: 24-Feb-2026 15:00:00 Start time: 24-Feb-2026 15:00:02 End time: 24-Feb-2026 15:03:12 Elapsed time: 3 mins 10 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 1371 Volume Session Time: 1770030267 Last Volume Bytes: 300,830,071,106 (300.8 GB) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: Error SD termination status: Waiting on FD Termination: *** Backup Error ***
In that case.. I would think it only happened because it attempted to connect while the host was being reimaged.
current status of bacula-fd looks normal:
[gerrit2002:/var/lib/gerrit] $ systemctl status bacula-fd
● bacula-fd.service - Bacula File Daemon service
Loaded: loaded (/lib/systemd/system/bacula-fd.service; enabled; preset: enabled)
Active: active (running) since Tue 2026-02-24 15:05:47 UTC; 3h 18min agoChange #1243257 had a related patch set uploaded (by Hashar; author: Hashar):
[operations/puppet@production] gerrit: update gerrit2002 after reimaging
Change #1243257 merged by Bking:
[operations/puppet@production] gerrit: update gerrit2002 after reimaging
The Gerrit replication to gerrit2002 (spare host) got broken following the reimaging. From the replication dashboard:
The latency has a gap and there was a high rate of retries:
I went to look at /var/log/gerrit/replication_log and surely jgit was complaining about accepting the host key. It did not get updated in Puppet :-]
Done by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1243257 and reviewed/merged by Brian King.
Yes, they are ok now.
It got overloaded afterwards, so it was beyond the reimage time:
But things are ok now.
Change #1243183 merged by Dzahn:
[operations/puppet@production] backup: adjust gerrit file set after renaming of gerrit2

