Page MenuHomePhabricator

Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7]
Open, MediumPublic

Description

Five new hosts (aqs102[3-7]) have been provisioned as replacements for aqs101[0-2,4-5] (which are now EOL). The new hosts need to be configured & bootstrapped, and the old decommissioned.

New hostReplacesCassandra rack IDPlacement
aqs1023aqs1010rack1D1
aqs1024aqs1011rack2E7
aqs1025aqs1012rack3C4
aqs1026aqs1014rack2E8
aqs1027aqs1015rack3C7
  • Reimage (vlan migration)
    • aqs1023
    • aqs1024
    • aqs1025
    • aqs1026
    • aqs1027
  • Disk configuration
    • aqs1023
    • aqs1024
    • aqs1025
    • aqs1026
    • aqs1027
  • Bootstrap
    • aqs1023
    • aqs1024
    • aqs1025
    • aqs1026
    • aqs1027
  • Decommission
    • aqs1010
    • aqs1011 ...in-progress
    • aqs1012
    • aqs1014
    • aqs1015
  • Update nodes lists
    • scap(?)
    • k8s
    • seeds (puppet)
  • Cleanups
    • rack1
    • rack2
    • rack3

See also:

{T404774}
T407032: Q2:rack/setup/install aqs102[3-7]
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195276

Event Timeline

Eevans triaged this task as Medium priority.Feb 10 2026, 8:08 PM
Eevans updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • aqs1023 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console aqs1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • aqs1023 (FAIL)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console aqs1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • aqs1023 (FAIL)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console aqs1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Failing due to T411054: Nokia SR-Linux DHCP Relay Bug

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye completed:

  • aqs1023 (PASS)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603301729_eevans_3751442_aqs1023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1024.eqiad.wmnet with OS bullseye completed:

  • aqs1024 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603301910_eevans_3867022_aqs1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1025.eqiad.wmnet with OS bullseye completed:

  • aqs1025 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603302018_eevans_3919722_aqs1025.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1026.eqiad.wmnet with OS bullseye completed:

  • aqs1026 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603302100_eevans_3978987_aqs1026.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1027.eqiad.wmnet with OS bullseye completed:

  • aqs1027 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603302304_eevans_4014735_aqs1027.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1264800 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1023: assign aqs role

https://gerrit.wikimedia.org/r/1264800

Change #1264801 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1024: assign aqs role

https://gerrit.wikimedia.org/r/1264801

Change #1264802 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1025: assign aqs role

https://gerrit.wikimedia.org/r/1264802

Change #1264803 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1026: assign aqs role

https://gerrit.wikimedia.org/r/1264803

Change #1264804 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1027: assign aqs role

https://gerrit.wikimedia.org/r/1264804

Change #1264800 merged by Eevans:

[operations/puppet@production] aqs1023: assign aqs role

https://gerrit.wikimedia.org/r/1264800

Change #1268642 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1023: add secondary IPs

https://gerrit.wikimedia.org/r/1268642

Change #1268642 merged by Eevans:

[operations/puppet@production] aqs1023: add secondary IPs

https://gerrit.wikimedia.org/r/1268642

Icinga downtime and Alertmanager silence (ID=b1a37350-0009-4843-b8b1-394b77774b92) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1023.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-07T20:33:41Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1023.eqiad.wmnet with reason: Bootstrapping — T412830

Change #1268660 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs1023: configure data file directories

https://gerrit.wikimedia.org/r/1268660

Change #1268660 merged by Eevans:

[operations/puppet@production] aqs1023: configure data file directories

https://gerrit.wikimedia.org/r/1268660

Change #1264801 merged by Eevans:

[operations/puppet@production] aqs1024: assign aqs role & configure

https://gerrit.wikimedia.org/r/1264801

Icinga downtime and Alertmanager silence (ID=b690c4dd-b459-40b6-813c-a96488b14a70) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1024.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-08T19:01:32Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1024.eqiad.wmnet with reason: Bootstrapping — T412830

Eevans updated the task description. (Show Details)

Change #1270496 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] aqs2-common: Remove decommed aqs1012

https://gerrit.wikimedia.org/r/1270496

Change #1270496 merged by jenkins-bot:

[operations/deployment-charts@master] aqs2-common: Remove decommed aqs1012

https://gerrit.wikimedia.org/r/1270496

Change #1264802 merged by Eevans:

[operations/puppet@production] aqs1025: assign aqs role & configure

https://gerrit.wikimedia.org/r/1264802

Icinga downtime and Alertmanager silence (ID=04c3d1ab-e7a9-4f5e-a7d6-5cabf6e41715) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1025.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-13T21:08:54Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1025.eqiad.wmnet with reason: Bootstrapping — T412830

Eevans updated the task description. (Show Details)

Change #1264803 merged by Eevans:

[operations/puppet@production] aqs1026: assign aqs role & configure

https://gerrit.wikimedia.org/r/1264803

Icinga downtime and Alertmanager silence (ID=503b5b97-7a74-4418-83cb-3ce3c1da025f) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1026.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-14T19:16:35Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1026.eqiad.wmnet with reason: Bootstrapping — T412830

Change #1264804 merged by Eevans:

[operations/puppet@production] aqs1027: assign aqs role & configure

https://gerrit.wikimedia.org/r/1264804

Icinga downtime and Alertmanager silence (ID=4f5cffba-d44f-471a-9cc9-a53bea9d80cf) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1027.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-15T18:26:59Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1027.eqiad.wmnet with reason: Bootstrapping — T412830

Change #1271985 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] installserver: configure new aqs hosts for partition reuse

https://gerrit.wikimedia.org/r/1271985

Change #1271985 merged by Eevans:

[operations/puppet@production] installserver: configure new aqs hosts for partition reuse

https://gerrit.wikimedia.org/r/1271985

Mentioned in SAL (#wikimedia-operations) [2026-04-16T13:41:32Z] <urandom> decommissioning Cassandra [a,b] on aqs1010 — T412830

Icinga downtime and Alertmanager silence (ID=539f254f-49d2-40f7-8633-d86080a0d5be) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1010.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-16T16:11:22Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: Bootstrapping — T412830

Icinga downtime and Alertmanager silence (ID=c515929f-9d76-4116-9fa1-b57d347d3300) set by eevans@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T412830

aqs1011.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-17T14:06:33Z] <eevans@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: Bootstrapping — T412830

Mentioned in SAL (#wikimedia-operations) [2026-04-17T14:09:25Z] <urandom> decommissioning Cassandra, aqs1011 [a,b] — T412830