⚓ T365763 Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234]

Subject	Repo	Branch	Lines +/-
cp5020: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5019: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5018: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5017: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
depool eqsin for text cluster drive upgrade	operations/dns	master	+3 -0
cp5024: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5023: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5022: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0
cp5021: update hieradata for dual NVMe disks configuration	operations/puppet	production	+5 -0

RobH created this task.May 23 2024, 9:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2024, 9:16 PM

RobH renamed this task from Q#:rack/setup/install X to Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234].May 23 2024, 9:17 PM

RobH mentioned this in Unknown Object (Task).May 23 2024, 9:24 PM

RobH added a parent task: Unknown Object (Task).

RobH added a subtask: Unknown Object (Task).

RobH moved this task from Backlog to Racking Tasks on the ops-eqsin board.May 23 2024, 9:27 PM

Maintenance_bot added a project: SRE.May 23 2024, 9:29 PM

RobH updated the task description. (Show Details)May 23 2024, 9:38 PM

RobH added a subscriber: ssingh.

RobH updated the task description. (Show Details)May 23 2024, 9:41 PM

RobH added a subscriber: Fabfur.

RobH updated the task description. (Show Details)Tue, Jun 11, 4:04 PM

RobH updated the task description. (Show Details)Fri, Jun 14, 10:25 PM

ssingh updated the task description. (Show Details)Fri, Jun 21, 3:50 PM

ssingh updated the task description. (Show Details)

BCornwall updated the task description. (Show Details)Fri, Jun 21, 4:27 PM

Change #1049168 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5017: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049168

Change #1049169 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5018: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049169

Change #1049170 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5019: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049170

Change #1049171 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5020: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049171

Change #1049172 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5021: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049172

Change #1049173 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5022: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049173

Change #1049174 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5023: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049174

Change #1049175 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5024: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049175

Change #1049232 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/dns@master] depool ulsfo for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1049232

RobH updated the task description. (Show Details)Mon, Jun 24, 10:02 PM

RobH updated the task description. (Show Details)

Change #1049232 merged by BCornwall:

[operations/dns@master] depool eqsin for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1049232

Mentioned in SAL (#wikimedia-operations) [2024-06-24T23:02:05Z] <brett> Running authdns-update on dns1004 to depool eqsin - T365763

BCornwall updated the task description. (Show Details)Mon, Jun 24, 11:33 PM

BCornwall updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-06-25T00:01:08Z] <brett@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T00:01:33Z] <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: T365763

BCornwall updated the task description. (Show Details)Tue, Jun 25, 1:11 AM

BCornwall updated the task description. (Show Details)Tue, Jun 25, 1:35 AM

Mentioned in SAL (#wikimedia-operations) [2024-06-25T01:40:11Z] <brett> Removing downtime for cp[5017-5024] as nvme drives are installed and hosts back online - T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T01:48:27Z] <brett> Running authdns-update on dns1004 to pool eqsin - T365763

BCornwall updated the task description. (Show Details)Tue, Jun 25, 1:54 AM

BCornwall updated the task description. (Show Details)

RobH removed RobH as the assignee of this task.Tue, Jun 25, 4:24 PM

RobH closed subtask Unknown Object (Task) as Resolved.

RobH updated the task description. (Show Details)

RobH unsubscribed.

BCornwall claimed this task.Tue, Jun 25, 4:59 PM

BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Change #1049168 merged by BCornwall:

[operations/puppet@production] cp5017: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049168

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye executed with errors:

cp5017 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5017.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye completed:

cp5017 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406251928_brett_704326_cp5017.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1049169 merged by BCornwall:

[operations/puppet@production] cp5018: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049169

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye completed:

cp5018 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406261706_brett_1276932_cp5018.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

BCornwall updated the task description. (Show Details)Wed, Jun 26, 5:43 PM

Change #1049170 merged by BCornwall:

[operations/puppet@production] cp5019: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049170

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye executed with errors:

cp5019 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5019.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye completed:

cp5019 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406261933_brett_1340367_cp5019.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1049171 merged by BCornwall:

[operations/puppet@production] cp5020: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049171

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors:

cp5020 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5020.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye completed:

cp5020 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406262150_brett_1403267_cp5020.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

BCornwall updated the task description. (Show Details)Wed, Jun 26, 10:26 PM

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye completed:

cp5021 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406262323_brett_1443394_cp5021.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234]
Open, MediumPublic
Actions

Description

Scheduling

Communication

Action checklist

Reimaging Process

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
		Unknown Object (Task)
Open	BCornwall	T365763 Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]\|2[01234]
		Unknown Object (Task)

Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234]Open, MediumPublicActions