Page MenuHomePhabricator

Reimage gerrit2002
Closed, ResolvedPublic

Description

Following up on T387833: Gerrit switchover process we'll need to reimage gerrit2002 on Debian 12.

Event Timeline

ABran-WMF moved this task from Incoming to Backlog on the collaboration-services board.

Change #1240689 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1240689

ABran-WMF changed the task status from Open to Stalled.Feb 19 2026, 1:53 PM

moving to stalled until https://gerrit.wikimedia.org/r/c/1240689 is merged

Change #1240689 merged by Arnaudb:

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1240689

ABran-WMF changed the task status from Stalled to Open.Feb 20 2026, 10:38 AM
ABran-WMF added a parent task: Restricted Task.Feb 20 2026, 4:31 PM

Change #1242268 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] gerrit: swap gerrit-replica and gerrit-spare

https://gerrit.wikimedia.org/r/1242268

Change #1242269 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: swap gerrit-spare and gerrit-replica

https://gerrit.wikimedia.org/r/1242269

Change #1242272 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: disable service on gerrit2002 to reimage

https://gerrit.wikimedia.org/r/1242272

Change #1242275 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: prepare replication resume for gerrit2002

https://gerrit.wikimedia.org/r/1242275

Change #1242279 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1242279

Change #1242268 merged by Arnaudb:

[operations/dns@master] gerrit: swap gerrit-replica and gerrit-spare

https://gerrit.wikimedia.org/r/1242268

Change #1242269 merged by Arnaudb:

[operations/puppet@production] gerrit: swap gerrit-spare and gerrit-replica

https://gerrit.wikimedia.org/r/1242269

Change #1242272 merged by Arnaudb:

[operations/puppet@production] gerrit: disable service on gerrit2002 to reimage

https://gerrit.wikimedia.org/r/1242272

ABran-WMF changed the task status from Open to In Progress.Feb 24 2026, 1:15 PM
ABran-WMF moved this task from Backlog to Work in Progress on the collaboration-services board.

Change #1243119 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] gerrit: fix discovery record

https://gerrit.wikimedia.org/r/1243119

Change #1243119 merged by Arnaudb:

[operations/dns@master] gerrit: fix discovery record

https://gerrit.wikimedia.org/r/1243119

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm

Change #1243131 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1243131

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm completed:

  • gerrit2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602241354_arnaudb_766456_gerrit2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1242275 merged by Arnaudb:

[operations/puppet@production] gerrit: migrate gerrit2 system user to gerrit

https://gerrit.wikimedia.org/r/1242275

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm

I've updated the relation chain here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242275
We were putting gerrit2002 back in the role(gerrit) puppet group too soon for the migration to happen properly.
There is one more step in that chain to handle that

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1003 for host gerrit2002.wikimedia.org with OS bookworm completed:

  • gerrit2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602241443_arnaudb_816182_gerrit2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1242279 merged by Arnaudb:

[operations/puppet@production] gerrit: install gerrit and sync-instances

https://gerrit.wikimedia.org/r/1242279

sync-instances cookbook running

Change #1243131 merged by Arnaudb:

[operations/puppet@production] gerrit: resume replication on gerrit-spare

https://gerrit.wikimedia.org/r/1243131

replication has been warmed up, @Dzahn will handle the last step:

sudo cookbook sre.gerrit.restart-gerrit -t T417247 --host gerrit2003

because gerrit replication has been configured and the primary instance needs to be restarted to handle the new config change

gerrit2002 is now gerrit-spare https://grafana.wikimedia.org/goto/Zoho3OOvg?orgId=1, replication is catching up

Backups from gerrit2002 are failing with:

Could not stat "/var/lib/gerrit": ERR=No such file or directory

Same for: gerrit2003.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data ¿Is this expected?

It would make sense if it was "/var/lib/gerrit2: No such file or directory".

[gerrit2002:/var/lib] $ cd /var/lib/gerrit2
-bash: cd: /var/lib/gerrit2: No such file or directory
[gerrit2002:/var/lib] $ cd /var/lib/gerrit
[gerrit2002:/var/lib/gerrit] $

since we got rid of the "2" in "gerrit2".

Our backup::set definition contains both "gerrit" and "gerrit2" pathes because for a transitional period both existed.

same on gerrit2003. /var/lib/gerrit2 is gone, /var/lib/gerrit is not. the exact opposite of what that error message seems to say.

Change #1243183 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] backup: adjust gerrit file set after renaming of gerrit2

https://gerrit.wikimedia.org/r/1243183

Then that message may be misleading and not the cause of the issues, but the error is real:

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
684387  Incr           0         0   Error    24-Feb-26 14:03 gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data

This is the real error:

24-Feb 15:00 backup1014.eqiad.wmnet JobId 684390: Start Backup JobId 684390, Job=gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data.2026-02-24_15.00.00_52
24-Feb 15:00 backup1014.eqiad.wmnet JobId 684390: Using Device "FileStorageReposEqiad" to write.
24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Fatal error: bsockcore.c:208 Unable to connect to Client: gerrit2002.wikimedia.org-fd on gerrit2002.wikimedia.org:9102. E
RR=Interrupted system call
24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Fatal error: No Job status returned from FD.
24-Feb 15:03 backup1014.eqiad.wmnet JobId 684390: Error: Bacula backup1014.eqiad.wmnet 9.6.7 (10Dec20):
  Build OS:               x86_64-pc-linux-gnu debian bookworm/sid
  JobId:                  684390
  Job:                    gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data.2026-02-24_15.00.00_52
  Backup Level:           Incremental, since=2026-02-24 13:00:03
  Client:                 "gerrit2002.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid
  FileSet:                "gerrit-repo-data" 2024-11-06 18:00:00
  Pool:                   "ReposEqiad" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1012-FileStorageReposEqiad" (From Pool resource)
  Scheduled time:         24-Feb-2026 15:00:00
  Start time:             24-Feb-2026 15:00:02
  End time:               24-Feb-2026 15:03:12
  Elapsed time:           3 mins 10 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      1371
  Volume Session Time:    1770030267
  Last Volume Bytes:      300,830,071,106 (300.8 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Waiting on FD
  Termination:            *** Backup Error ***

In that case.. I would think it only happened because it attempted to connect while the host was being reimaged.

current status of bacula-fd looks normal:

[gerrit2002:/var/lib/gerrit] $ systemctl status bacula-fd
● bacula-fd.service - Bacula File Daemon service
     Loaded: loaded (/lib/systemd/system/bacula-fd.service; enabled; preset: enabled)
     Active: active (running) since Tue 2026-02-24 15:05:47 UTC; 3h 18min ago

Change #1243257 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: update gerrit2002 after reimaging

https://gerrit.wikimedia.org/r/1243257

Change #1243257 merged by Bking:

[operations/puppet@production] gerrit: update gerrit2002 after reimaging

https://gerrit.wikimedia.org/r/1243257

The Gerrit replication to gerrit2002 (spare host) got broken following the reimaging. From the replication dashboard:

The latency has a gap and there was a high rate of retries:

gerrit2002_spare_replication_stopped.png (503×1 px, 70 KB)

I went to look at /var/log/gerrit/replication_log and surely jgit was complaining about accepting the host key. It did not get updated in Puppet :-]

Done by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1243257 and reviewed/merged by Brian King.

@jcrespo do you confirm backups are now OK?

Yes, they are ok now.

In that case.. I would think it only happened because it attempted to connect while the host was being reimaged.

It got overloaded afterwards, so it was beyond the reimage time:

image.png (108×902 px, 31 KB)

But things are ok now.

Change #1243183 merged by Dzahn:

[operations/puppet@production] backup: adjust gerrit file set after renaming of gerrit2

https://gerrit.wikimedia.org/r/1243183

Thanks for the confirmation @jcrespo I think this can be considered as resolved then