Page MenuHomePhabricator

Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudmetrics100[34].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: cloudmetrics100[34].eqiad.wmnet
Racking Proposal: The original entry for Racking proposal provided by WMCS was "Use WMCS dedicated racks. These can be placed in any WMCS rack and co-exist with any WMCS service. However, please place each in a separate row." however I (RobH) am not aware of any 1G dedicated WMCS racks outside of row B. So I'll have to defer to the judgement of the on-sites in what 1G racks to place these in. Please note the network requirements also provided by WMCS list cloud-support1 vlan, which only exists in rows A and C from what I see in netbox. If in doubt, stall racking these and request clarification from WMCS.
Networking/Subnet/VLAN/IP: Single 1G network connection to the cloud-support1 vlan.
Partitioning/Raid: standard raid10-4dev
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudmetrics1003:

  • - receive in system on procurement task T286589 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[x]x - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

  • - host state in netbox set to staged

cloudmetrics1004:

  • - receive in system on procurement task T286589 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH renamed this task from (Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet to Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet.Aug 27 2021, 9:31 PM
RobH created this task.
RobH moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).

@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).

cloudmetrics1003 A6 U29 Port29 Cableid#1952
cloudmetrics1004 C5 U29 Port34 Cableid#3315

Jclark-ctr added subscribers: wiki_willy, Jclark-ctr.

@wiki_willy can you help clarify racking requirements for these host? @cmjohnson1 These have been racked in racks that support cloud-support1 vlan I believe there is some uncertainty if the vlan was going away.

Hi @nskaggs & @aborrero - let us know what you decide, so we can make sure these servers are properly placed. Thanks!

@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).

In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.

Cool, thanks for confirming @Bstorm ...we'll definitely miss working with ya, and wish you all the best!

In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.

Thanks for getting these racked up @Jclark-ctr, all yours @Cmjohnson

Change 728633 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding site.pp entry and dhcpd entry for cloudmetric100[34]

https://gerrit.wikimedia.org/r/728633

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081959_cmjohnson_27096_cloudmetrics1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082000_cmjohnson_27191_cloudmetrics1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1004.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1004.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Change 728633 merged by Cmjohnson:

[operations/puppet@production] Adding site.pp entry and dhcpd entry for cloudmetric100[34]

https://gerrit.wikimedia.org/r/728633

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082011_cmjohnson_30438_cloudmetrics1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082012_cmjohnson_30855_cloudmetrics1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudmetrics1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110082016_cmjohnson_3482_cloudmetrics1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1004.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1004.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudmetrics1003.eqiad.wmnet']

Of which those FAILED:

['cloudmetrics1003.eqiad.wmnet']

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudmetrics1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bullseye completed:

  • cloudmetrics1004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110191710_cmjohnson_22802_cloudmetrics1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye completed:

  • cloudmetrics1003 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110191812_cmjohnson_17505_cloudmetrics1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

These are finished with on-site work and ready to be turned over

Change 745948 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2]

https://gerrit.wikimedia.org/r/745948

Change 745949 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host

https://gerrit.wikimedia.org/r/745949

Change 745950 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary

https://gerrit.wikimedia.org/r/745950

Change 745951 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom

https://gerrit.wikimedia.org/r/745951

Change 745948 merged by Andrew Bogott:

[operations/puppet@production] Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2]

https://gerrit.wikimedia.org/r/745948

update; It seems we aren't ready to run grafana on bullseye yet so I'm rolling these back to Buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster completed:

  • cloudmetrics1003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132104_andrew_32090_cloudmetrics1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS buster completed:

  • cloudmetrics1004 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132120_andrew_6812_cloudmetrics1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 745949 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host

https://gerrit.wikimedia.org/r/745949

Change 745950 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary

https://gerrit.wikimedia.org/r/745950

Change 745951 merged by Andrew Bogott:

[operations/puppet@production] cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom

https://gerrit.wikimedia.org/r/745951

Change 747494 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/mediawiki-config@master] LabsServices: refresh cloudmetrics server

https://gerrit.wikimedia.org/r/747494