⚓ T289888 Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet

Subject	Repo	Branch	Lines +/-
cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom	operations/puppet	production	+4 -35
cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary	operations/puppet	production	+11 -11
cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host	operations/puppet	production	+1 -1
Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2]	operations/puppet	production	+33 -3
Adding site.pp entry and dhcpd entry for cloudmetric100[34]	operations/puppet	production	+15 -0

RobH renamed this task from (Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet to Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet.Aug 27 2021, 9:31 PM

RobH created this task.

RobH moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH added a parent task: Unknown Object (Task).

RobH mentioned this in Unknown Object (Task).

RobH assigned this task to Jclark-ctr.Aug 27 2021, 9:33 PM

RobH unsubscribed.

Maintenance_bot added a project: SRE.Aug 27 2021, 9:45 PM

Jclark-ctr updated the task description. (Show Details)Oct 2 2021, 1:41 AM

@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).

cloudmetrics1003 A6 U29 Port29 Cableid#1952
cloudmetrics1004 C5 U29 Port34 Cableid#3315

Jclark-ctr updated the task description. (Show Details)Oct 6 2021, 7:22 PM

@wiki_willy can you help clarify racking requirements for these host? @cmjohnson1 These have been racked in racks that support cloud-support1 vlan I believe there is some uncertainty if the vlan was going away.

Hi @nskaggs & @aborrero - let us know what you decide, so we can make sure these servers are properly placed. Thanks!

In T289888#7403968, @Bstorm wrote:

@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).

In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.

Cool, thanks for confirming @Bstorm ...we'll definitely miss working with ya, and wish you all the best!

In T289888#7407068, @Bstorm wrote:

In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.

Thanks for getting these racked up @Jclark-ctr, all yours @Cmjohnson

• Cmjohnson updated the task description. (Show Details)Oct 7 2021, 1:21 PM

Change 728633 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding site.pp entry and dhcpd entry for cloudmetric100[34]

https://gerrit.wikimedia.org/r/728633

gerritbot added a project: Patch-For-Review.Oct 8 2021, 7:51 PM

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081959_cmjohnson_27096_cloudmetrics1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082000_cmjohnson_27191_cloudmetrics1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1004.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1004.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Change 728633 merged by Cmjohnson:

[operations/puppet@production] Adding site.pp entry and dhcpd entry for cloudmetric100[34]

https://gerrit.wikimedia.org/r/728633

Maintenance_bot removed a project: Patch-For-Review.Oct 8 2021, 8:10 PM

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082011_cmjohnson_30438_cloudmetrics1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082012_cmjohnson_30855_cloudmetrics1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082016_cmjohnson_3482_cloudmetrics1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1004.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1004.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye executed with errors:

cloudmetrics1003 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bullseye completed:

cloudmetrics1004 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110191710_cmjohnson_22802_cloudmetrics1004.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye completed:

cloudmetrics1003 (WARN)
- Downtimed on Icinga
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110191812_cmjohnson_17505_cloudmetrics1003.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

These are finished with on-site work and ready to be turned over

• Cmjohnson mentioned this in T284471: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.Oct 25 2021, 3:08 PM

Change 745948 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2]

https://gerrit.wikimedia.org/r/745948

Change 745949 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host

https://gerrit.wikimedia.org/r/745949

Change 745950 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary

https://gerrit.wikimedia.org/r/745950

Change 745951 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom

https://gerrit.wikimedia.org/r/745951

Change 745948 merged by Andrew Bogott:

[operations/puppet@production] Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2]

https://gerrit.wikimedia.org/r/745948

update; It seems we aren't ready to run grafana on bullseye yet so I'm rolling these back to Buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster completed:

cloudmetrics1003 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132104_andrew_32090_cloudmetrics1003.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS buster completed:

cloudmetrics1004 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132120_andrew_6812_cloudmetrics1004.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 745949 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host

https://gerrit.wikimedia.org/r/745949

Change 745950 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary

https://gerrit.wikimedia.org/r/745950

Change 745951 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom

https://gerrit.wikimedia.org/r/745951

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2021, 2:10 AM

Change 747494 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/mediawiki-config@master] LabsServices: refresh cloudmetrics server

https://gerrit.wikimedia.org/r/747494

gerritbot added a project: Patch-For-Review.Dec 15 2021, 11:24 AM

RobH mentioned this in T297814: cloudmetrics1003 seizes up under load.Dec 20 2021, 6:35 PM

aborrero added a subtask: T299744: cloudmetrics1004 potential hardware problem.Jan 21 2022, 9:13 AM

aborrero added a subtask: T297814: cloudmetrics1003 seizes up under load.

wiki_willy closed subtask T299744: cloudmetrics1004 potential hardware problem as Resolved.Feb 2 2022, 10:47 PM

wiki_willy closed subtask T297814: cloudmetrics1003 seizes up under load as Resolved.

Andrew reopened subtask T297814: cloudmetrics1003 seizes up under load as Open.Feb 2 2022, 11:53 PM

Andrew closed subtask T297814: cloudmetrics1003 seizes up under load as Resolved.Feb 3 2022, 4:44 PM

Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Cmjohnson	T289888 Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet
Resolved	• Cmjohnson	T299744 cloudmetrics1004 potential hardware problem
Resolved	• Cmjohnson	T297814 cloudmetrics1003 seizes up under load

Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnetClosed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...