Page MenuHomePhabricator

Q3:rack/setup/install ms-be207[0-3]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-be207[0-3]

Hostname / Racking / Installation Details

Hostnames: ms-be207[0-3].codfw.wmnet
Racking Proposal: 1 node per rack please
Networking Setup: # 10G private VLAN like existing ms-be* nodes
Partitioning/Raid: JBOD, please unlike previous ms-be* nodes, we now want everything non-RAID (cf T308677)
OS Distro: Bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

ms-be2070: Rack A4 - U9 - Port 8
  • - receive in system on procurement task T325211 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2071: Rack B4 - U11 - Port 10
  • - receive in system on procurement task T325211 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2072: Rack C4 - U9 - Port 8
  • - receive in system on procurement task T325211 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2073: Rack D7 - U9 - Port 7
  • - receive in system on procurement task T325211 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.
Jhancock.wm updated the task description. (Show Details)
Jhancock.wm subscribed.

I need the partman recipe for those nodes

I am still waiting on the partman recipe.

Hi, this is on my TODO, but these backends are a low priority for us at the moment (compared to the frontends, which are a really high priority).

Sorry for the delay.

I understand that this is a low priority for you but it is not for me since i have to meet my install SLA's of 30days. I can remove the ops-codfw tag and remove myself from the task and assign it to you to do the install when ready and i can consider this done on my end.

Please don't do that; I'll try and get back to you before the end of the week.

Change 894009 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

Change 894009 merged by MVernon:

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

@Papaul there's now a partman recipe for these new nodes (see the above merged CR).

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye

@MatthewVernon on ms-be2070 the OS install did complete with no issues using the partman recipe and server did boot into the OS. However, after the puppet run and after the second reboot by the cookbook the server is stuck in a continuous boot loop and not booting into the OS anymore.

Before the the OS install the boot device was set to first SSD and this boot without any issue into the OS but got to the continuous boot loop after the puppet run and the second reboot by the cookbook.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2070 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303031622_pt1979_3127564_ms-be2070.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye

on the second run i got

Booting from Hard drive C:
GRUB

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2070 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303031858_pt1979_3201152_ms-be2070.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.

But then the first puppet run seemingly does something that hoses the boot process - console has

Booting from Hard drive C:
GRUB

and is entirely unresponsive.

A quick look through the puppet log isn't very enlightening :-/

@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.

But then the first puppet run seemingly does something that hoses the boot process - console has

Booting from Hard drive C:
GRUB

@MatthewVernon i havn't looked in depth but my best guess is that yu need to set profile::swift::storage::disks_by_path: true for the specific hosts. This has puppet use profile::swift::storage::configure_disks and not swift::init_device which i suspect is what is wiping out the partitions. fyi you will also need to do an update to modules/swift/files/{eqiad,codfw}-prod_hosts.yaml at some point as well

d'oh, that seems likely, thank you!

[yes we'll need a new storage schema in hosts.yaml, but that's when actually bringing into service, which isn't really a priority ATM]

Change 895141 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

Change 895141 merged by MVernon:

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye completed:

  • ms-be2070 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071144_mvernon_2594352_ms-be2070.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

@Papaul I've fixed the underlying problems and you'll see ms-be2070 reimaged to successful completion now, so hopefully that's you unblocked here.

...the icinga warning was systemd timing out waiting for smartd to start up (takes about 2 minutes).

Change 895309 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] smart: override unit to make systemd wait longer

https://gerrit.wikimedia.org/r/895309

Change 895309 merged by MVernon:

[operations/puppet@production] smart: override unit to make systemd wait longer

https://gerrit.wikimedia.org/r/895309

@MatthewVernon thank you I will try on ms-be2071 and let you know

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye

@MatthewVernon looks like ms-be2071 is happy second reboot got the server back into the OS so just waiting for it to finish now.

@MatthewVernon puppet is failing with the error below on ms-be2071

Error: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:18:00.0-scsi-0:0:0:0]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1' returned 1 instead of one of [0] (corrective)

Yeah, I saw similar on ms-be2070; the problem being the disk isn't entirely blank. I suspect

sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1
sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0

will do the trick.

Thanks that fixed the issue.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye completed:

  • ms-be2071 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071624_pt1979_1848725_ms-be2071.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye completed:

  • ms-be2072 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071826_pt1979_1944866_ms-be2072.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye completed:

  • ms-be2073 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303072029_pt1979_2045062_ms-be2073.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

@MatthewVernon all yours thank you for getting the partman recipe