Page MenuHomePhabricator

Bootstrap new Cassandra nodes (codfw)
Closed, ResolvedPublic

Description

Add aqs2001-2012 to the AQS Cassandra cluster.

  • aqs2001
  • aqs2002
  • aqs2003
  • aqs2004
  • aqs2005
  • aqs2006
  • aqs2007
  • aqs2008
  • aqs2009
  • aqs2010
  • aqs2011
  • aqs2012

See: T305568

Event Timeline

Eevans triaged this task as Medium priority.May 6 2022, 4:00 PM

Change 802604 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] WIP: Configure AQS Cassandra hosts

https://gerrit.wikimedia.org/r/802604

Change 802631 had a related patch set uploaded (by Eevans; author: Eevans):

[labs/private@master] Dummy keys and certificates for cassandra (aqs)

https://gerrit.wikimedia.org/r/802631

Change 802631 merged by MVernon:

[labs/private@master] Dummy keys and certificates for cassandra (aqs)

https://gerrit.wikimedia.org/r/802631

Change 802604 merged by MVernon:

[operations/puppet@production] Configure AQS Cassandra hosts (codfw)

https://gerrit.wikimedia.org/r/802604

These were installed with bullseye (the default), and we have thus far only run Cassandra on <= buster. We are missing the cassandradev component for bullseye, and jvm-tools and cassandra-tools-wmf are missing from bullseye main. In addition, cassandra-tools-wmf has a package dependency on python-yaml, which is no longer shipped in bullseye. I don't anticipate that we'd have any other issues running Cassandra on Debian 11.

An (unplanned) transition to bullseye feels a little like scope creep, but tackling this now means migrating 18 less nodes in the future (the 12 here, and the 6 additional going up in eqiad), which seems worth doing.

@hnowlan do you know if there would be any issues running RESTBase (AQS ~= RESTBase) on Bullseye?

These were installed with bullseye (the default), and we have thus far only run Cassandra on <= buster. We are missing the cassandradev component for bullseye, and jvm-tools and cassandra-tools-wmf are missing from bullseye main. In addition, cassandra-tools-wmf has a package dependency on python-yaml, which is no longer shipped in bullseye. I don't anticipate that we'd have any other issues running Cassandra on Debian 11.

An (unplanned) transition to bullseye feels a little like scope creep, but tackling this now means migrating 18 less nodes in the future (the 12 here, and the 6 additional going up in eqiad), which seems worth doing.

@hnowlan do you know if there would be any issues running RESTBase (AQS ~= RESTBase) on Bullseye?

From an exchange on Slack:

Eric Evans 9:50 AM @hnowlan do you know if we are running RESTBase on Bullseye anywhere? Do you know if there would be any problems running it on Bullseye? NodeJS version compatibility maybe?
Hugh Nowlan 9:56 AM Node is definitely the major blocker I can see - openjdk packages have already been ported to bullseye afaict
Eric Evans 9:56 AM oh, so that would be a blocker?
Hugh Nowlan 9:57 AM as far as I know buster’s version would be a fairly significant leap forward. Petr was working on this just before he left afaik but I dunno if he made any real progress

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2001.codfw.wmnet with OS buster

Trying a reimage of aqs2001 with buster.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2001.codfw.wmnet with OS buster completed:

  • aqs2001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131530_mvernon_1607174_aqs2001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2002.codfw.wmnet with OS buster completed:

  • aqs2002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131628_mvernon_1615444_aqs2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2003.codfw.wmnet with OS buster completed:

  • aqs2003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206140802_mvernon_1724947_aqs2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2004.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2005.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2006.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2004.codfw.wmnet with OS buster completed:

  • aqs2004 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141006_mvernon_1742433_aqs2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2007.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2008.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2009.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2010.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2011.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2012.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2005.codfw.wmnet with OS buster completed:

  • aqs2005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141237_mvernon_1763692_aqs2005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

It'll take the cookbooks a while to catch up (they back-off in increasing intervals waiting for puppet to be OK), but after some deployment-related hassle, all of these nodes have had puppet run to completion on them OK, so they should be good to go for you @Eevans .

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2006.codfw.wmnet with OS buster completed:

  • aqs2006 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141239_mvernon_1763826_aqs2006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2007.codfw.wmnet with OS buster completed:

  • aqs2007 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141240_mvernon_1763918_aqs2007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2008.codfw.wmnet with OS buster completed:

  • aqs2008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141241_mvernon_1763987_aqs2008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2011.codfw.wmnet with OS buster completed:

  • aqs2011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141246_mvernon_1764554_aqs2011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2009.codfw.wmnet with OS buster completed:

  • aqs2009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141242_mvernon_1764107_aqs2009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2010.codfw.wmnet with OS buster completed:

  • aqs2010 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141245_mvernon_1764349_aqs2010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2012.codfw.wmnet with OS buster completed:

  • aqs2012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206141247_mvernon_1764669_aqs2012.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Eevans claimed this task.
Eevans updated the task description. (Show Details)