Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudvirt10[48-50].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: cloudvirt10[48-50].eqiad.wmnet
Racking Proposal: Place in WMCS racks. They can be placed anywhere and do not need to be spread across rows. Can be racked next to other cloudvirt*.
Networking/Subnet/VLAN/IP: 2 10G connections per server. Requires cloud-hosts1-eqiad VLAN. (same as existing cloudvirts)
Partitioning/Raid: 2dev standard raid1
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudvirt1048
  • - receive in system on procurement task T297727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudvirt1049
  • - receive in system on procurement task T297727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudvirt1050
  • - receive in system on procurement task T297727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.

Shouldn't be an issue with installing these in E4 / F4. However the configuration of the switches there won't be completed until next week, so if it's more urgent they may need to go to C8/D5.

Jclark-ctr added a subscriber: Jclark-ctr.

Name rack U port Cableid
cloudvirt1048 e4 26 30/31 20220085/20220086
cloudvirt1049 e4 27 32/33 20220079/20220078
cloudvirt1050 f4 26 30/31 20220082/20220080

These are racked but the switches are not in netbox yet. I am blocked

@Cmjohnson The switches are in Netbox:

https://netbox.wikimedia.org/dcim/devices/3931/

https://netbox.wikimedia.org/dcim/devices/3935

There are no Vlans configured for them yet though, so probably best to wait until we get into the configuration of them before adding. I'll update the task when we're at that stage.

cmooney added a subscriber: Cmjohnson.

This requires the updated WMCS network design to be agreed / validated (T304989) after which we can quickly complete the actual device configuration (T304936). Once that is ready we can proceed with the server provisioning as normal.

I believe this task is also now cautiously ready to proceed with the finalization of the design in T304989. @cmooney can you confirm?

@nskaggs I believe that to be the case yes. I've not been able to successfully reimage any of the cloudcephosd hosts that are also in a similar state though.

@Cmjohnson can you confirm the current status of these servers? Are they powered on and ready for next steps? That should be do-able now.

One thing to note as it's not been mentioned in the task description is that the '--enable-virtualization' flag should be used when running the sre.hosts.provision cookbook against them (as they are OpenStack hypervisors).

@cmooney the switches do not show up in netbox as an option for the provisioning script. I tagged Arzhel in a different ticket about it. Once completed I can start setting these up. Noted on the virtualization, that will need to be a manual change.

Change 809656 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new cloudvirt hosts to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/809656

Change 809656 abandoned by Cmjohnson:

[operations/puppet@production] Adding new cloudvirt hosts to site.pp and netboot.cfg

Reason:

https://gerrit.wikimedia.org/r/809656

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1048 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291936_cmjohnson_819891_cloudvirt1048.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1049 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291946_cmjohnson_821040_cloudvirt1049.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1050 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291946_cmjohnson_821063_cloudvirt1050.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1051 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291946_cmjohnson_821143_cloudvirt1051.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1052 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291946_cmjohnson_821192_cloudvirt1052.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1053 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291947_cmjohnson_821227_cloudvirt1053.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

pinging @Andrew so he knows the base image has been completed. Resolving the task.

Hi @Cmjohnson I think there was a mix-up for cloudvirt1050 in Netbox for the cable details.

Looking at the host's connections in Netbox: https://netbox.wikimedia.org/dcim/devices/4128/interfaces/

The 'mgmt' interface shows as being connected to cloudsw1-f4-eqiad xe-0/0/31. I think this is an error. The mgmt (iDRAC) interface on it should, and I believe is, connected to the management switch in that cab. We don't document those mgmt connections in Netbox.

Host interface 'eno2np1' is instead connected to cloudsw1-f4-eqiad xe-0/0/31 from what I can tell.

If you look at the Netbox interfaces for cloudvirt1049 it should look like that, 2 interfaces connected to the switch, no IP on the second one, and the mgmt interface with an IP but no connection. I think it should be enough to change the A-end termination on the cable to port eno2np1.

Can you have a look when you get a minute? Thanks.

@cmooney the 2nd interface requires manual input, I mistakenly connected it to the mgmt port. Updated

Change 814911 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Install nova on new cloudvirt hosts

https://gerrit.wikimedia.org/r/814911

Change 814911 merged by Andrew Bogott:

[operations/puppet@production] Install nova on new cloudvirt hosts

https://gerrit.wikimedia.org/r/814911

The second (VM) network does not seem to be working on these hosts -- I can schedule VMs (via the control plane) but those VMs have no network access. This doesn't look like a cabling issue, I suspect it's something further upstream.

These are in service now.