Page MenuHomePhabricator

Increase sessionstore storage capacity
Open, MediumPublic

Assigned To
None
Authored By
Eevans
Apr 10 2025, 12:22 AM
Referenced Files
F59918080: Cassandra JBOD(1).png
May 12 2025, 6:38 PM
F59706410: image.png
May 5 2025, 11:25 PM
F59700885: image.png
May 5 2025, 3:32 PM
F59700900: image.png
May 5 2025, 3:32 PM
F59028475: image.png
Apr 10 2025, 12:22 AM

Description

Prior to the outage on 2025-03-31 we were under the impression that sessionstore was wildly over-provisioned. What that incident demonstrated though, was that an aberrant workload has the potential to create rapid, unsustainable growth. Worse, every indication is that the workload in question was accidental / unintentional, a bad actor with an understanding of the circumstances could probably do much worse. We should increase storage capacity to increase runway, and buy us more time in such situations.

The current disk configuration uses two 480GB SSDs in a software RAID1. LVM is used to create volumes for swap, /, and /srv. The latter is used exclusively by Cassandra, and is ~370GB in size. What I propose is to use a smaller RAID1 for swap and /, and leave the remaining space on each drive for a JBOD configuration in Cassandra (w/ Cassandra system tables stored on the RAID). This would double the space available to Cassandra. Obviously this will require a reimage of each host.

image.png (644×916 px, 19 KB)

With this configuration in place, we can later add SSDs to the JBOD if we determine more space is needed.

Finally, there is some impetus to increase storage density in all our clusters (where possible), and the use of JBOD configurations in Cassandra is something being considered more widely (see: T380416: modernize cassandra deployments). The type of configuration discussed here seems as though it could be standardized and utilized for other clusters as well (read: we should keep that in mind).


Edit (2025-05-12):

The updated proposal looks something like this (see also: r1142635):

Not to scale

This sets aside ~60G for swap, /, and /srv/cassandra/instance-data from every drive, (critically, for those not needed for the RAID1). That's about 13% against the 480G SSDs used here (which seems like quite a lot), or ~3% for the 1.9T SSDs (which we probably ought to standardize on).

Maybe worth noting, the RAID1 couldl be extended over more than 2 drives, which could provide a bit more redundancy and read throughput (even if neither is really needed).


See also: T390630: Alert when disk space utilization on sessionstore nodes is trending high

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/alertsmaster+10 -10
operations/puppetproduction+1 -5
operations/puppetproduction+14 -0
operations/puppetproduction+1 -4
operations/puppetproduction+14 -0
operations/puppetproduction+2 -2
operations/puppetproduction+14 -0
operations/puppetproduction+4 -1
operations/puppetproduction+1 -4
operations/puppetproduction+14 -0
operations/puppetproduction+4 -1
operations/puppetproduction+10 -10
operations/puppetproduction+10 -10
operations/puppetproduction+2 -2
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+4 -0
operations/puppetproduction+10 -0
operations/puppetproduction+1 -1
operations/puppetproduction+5 -4
operations/puppetproduction+19 -25
operations/puppetproduction+12 -0
operations/puppetproduction+18 -9
operations/puppetproduction+1 -6
operations/puppetproduction+2 -2
operations/puppetproduction+19 -8
operations/puppetproduction+11 -0
operations/puppetproduction+2 -2
operations/puppetproduction+14 -1
operations/puppetproduction+6 -6
operations/puppetproduction+3 -3
operations/puppetproduction+8 -0
operations/puppetproduction+8 -8
operations/puppetproduction+8 -8
operations/puppetproduction+5 -1
operations/puppetproduction+106 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye completed:

  • sessionstore2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506231703_eevans_2848874_sessionstore2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-06-23T18:12:08Z] <urandom> bootstrapping Cassandra/sessionstore2005 — T391544

Change #1163007 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/alerts@master] adjust sessionstore disk utilization for JBOD

https://gerrit.wikimedia.org/r/1163007

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codfw.wmnet with OS bullseye completed:

  • cassandra-dev2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506231945_eevans_2867741_cassandra-dev2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codfw.wmnet with OS bullseye completed:

  • cassandra-dev2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506232308_eevans_2887253_cassandra-dev2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1163466 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev2001: testing new data file directory names

https://gerrit.wikimedia.org/r/1163466

Change #1163466 merged by Eevans:

[operations/puppet@production] cassandra-dev2001: testing new data file directory names

https://gerrit.wikimedia.org/r/1163466

Change #1163468 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev2001: actually update all directories

https://gerrit.wikimedia.org/r/1163468

Change #1163468 merged by Eevans:

[operations/puppet@production] cassandra-dev2001: actually update all directories

https://gerrit.wikimedia.org/r/1163468

Change #1163852 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev200[23]: setup for (no reuse) reimaging

https://gerrit.wikimedia.org/r/1163852

Change #1163853 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev2002: updated data_file_directories list

https://gerrit.wikimedia.org/r/1163853

Change #1163854 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev2003: updated data_file_directories list

https://gerrit.wikimedia.org/r/1163854

Change #1163852 merged by Eevans:

[operations/puppet@production] cassandra-dev200[23]: setup for (no reuse) reimaging

https://gerrit.wikimedia.org/r/1163852

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2002.codfw.wmnet with OS bullseye completed:

  • cassandra-dev2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506251926_eevans_3191025_cassandra-dev2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1163853 merged by Eevans:

[operations/puppet@production] cassandra-dev2002: updated data_file_directories list

https://gerrit.wikimedia.org/r/1163853

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2003.codfw.wmnet with OS bullseye

Change #1163854 merged by Eevans:

[operations/puppet@production] cassandra-dev2003: updated data_file_directories list

https://gerrit.wikimedia.org/r/1163854

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2003.codfw.wmnet with OS bullseye completed:

  • cassandra-dev2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506252036_eevans_3198851_cassandra-dev2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-06-26T13:45:32Z] <urandom> decommissioning Cassandra/sessionstore2004-a — T391544

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2004.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2004.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2004.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw.wmnet with OS bullseye completed:

  • sessionstore2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506261505_eevans_3316477_sessionstore2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw.wmnet with OS bullseye completed:

  • sessionstore2005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506261702_eevans_3329901_sessionstore2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1164305 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore2006: reimage to JBOD configuration

https://gerrit.wikimedia.org/r/1164305

Change #1164306 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore2006: setup JBOD-based data_file_directories

https://gerrit.wikimedia.org/r/1164306

Change #1164307 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore2006: preseed d-i for partition reuse

https://gerrit.wikimedia.org/r/1164307

Change #1164305 merged by Eevans:

[operations/puppet@production] sessionstore2006: reimage to JBOD configuration

https://gerrit.wikimedia.org/r/1164305

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2006.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye

Change #1164306 merged by Eevans:

[operations/puppet@production] sessionstore2006: setup JBOD-based data_file_directories

https://gerrit.wikimedia.org/r/1164306

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2006.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2006.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors:

  • sessionstore2006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore2006.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw.wmnet with OS bullseye completed:

  • sessionstore2006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506262252_eevans_3368024_sessionstore2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1164307 merged by Eevans:

[operations/puppet@production] sessionstore2006: preseed d-i for partition reuse

https://gerrit.wikimedia.org/r/1164307

Change #1165013 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1004: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165013

Change #1165014 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1004: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165014

Change #1165015 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1005: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165015

Change #1165016 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1005: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165016

Change #1165017 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1006: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165017

Change #1165018 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore1006: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165018

Change #1165019 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: preseed eqiad servers for partition reuse

https://gerrit.wikimedia.org/r/1165019

Change #1165013 merged by Eevans:

[operations/puppet@production] sessionstore1004: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165013

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1004.eqiad.wmnet with OS bullseye

Change #1165014 merged by Eevans:

[operations/puppet@production] sessionstore1004: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165014

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1004.eqiad.wmnet with OS bullseye completed:

  • sessionstore1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506301444_eevans_3872841_sessionstore1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1165015 merged by Eevans:

[operations/puppet@production] sessionstore1005: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165015

Mentioned in SAL (#wikimedia-operations) [2025-06-30T16:14:39Z] <urandom> decommissioning Cassandra/sessionstore1005-a — T391544

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1005.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1005.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1005.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1005.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye

Change #1165016 merged by Eevans:

[operations/puppet@production] sessionstore1005: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165016

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad.wmnet with OS bullseye completed:

  • sessionstore1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507012024_eevans_4056145_sessionstore1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1165017 merged by Eevans:

[operations/puppet@production] sessionstore1006: reimage for JBOD configuration

https://gerrit.wikimedia.org/r/1165017

Mentioned in SAL (#wikimedia-operations) [2025-07-07T14:10:47Z] <urandom> decommissioning Cassandra/sessionstore-a — T391544

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye

Change #1165018 merged by Eevans:

[operations/puppet@production] sessionstore1006: assign JBOD data_file_directories

https://gerrit.wikimedia.org/r/1165018

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507071446_eevans_682463_sessionstore1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye executed with errors:

  • sessionstore1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sessionstore1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad.wmnet with OS bullseye completed:

  • sessionstore1006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507071709_eevans_697397_sessionstore1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-07-07T18:10:37Z] <urandom> bootstrapping Cassandra/sessionstore1006-a — T391544

Change #1165019 merged by Eevans:

[operations/puppet@production] sessionstore: preseed eqiad servers for partition reuse

https://gerrit.wikimedia.org/r/1165019

Change #1163007 merged by Eevans:

[operations/alerts@master] adjust sessionstore disk utilization for JBOD

https://gerrit.wikimedia.org/r/1163007