⚓ T326352 Q3:rack/setup/install ms-be207[0-3]

Subject	Repo	Branch	Lines +/-
smart: override unit to make systemd wait longer	operations/puppet	production	+9 -0
hiera: use a regex to specify new-style storage hosts	operations/puppet	production	+5 -1
install_server: use newer partman setup for new ms backends	operations/puppet	production	+2 -2

RobH created this task.Jan 5 2023, 7:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 5 2023, 7:41 PM

RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.Jan 5 2023, 7:41 PM

RobH mentioned this in Unknown Object (Task).

RobH added a parent task: Unknown Object (Task).

RobH unsubscribed.

Maintenance_bot added a project: SRE.Jan 5 2023, 7:45 PM

Jhancock.wm claimed this task.Feb 14 2023, 9:17 PM

Jhancock.wm reassigned this task from Jhancock.wm to Papaul.Feb 16 2023, 9:15 PM

Jhancock.wm updated the task description. (Show Details)

Jhancock.wm subscribed.

Papaul updated the task description. (Show Details)Feb 23 2023, 3:02 AM

MatthewVernon added a project: SRE-swift-storage.Feb 23 2023, 4:42 PM

I need the partman recipe for those nodes

Papaul updated the task description. (Show Details)Feb 28 2023, 6:04 PM

I am still waiting on the partman recipe.

Hi, this is on my TODO, but these backends are a low priority for us at the moment (compared to the frontends, which are a really high priority).

Sorry for the delay.

I understand that this is a low priority for you but it is not for me since i have to meet my install SLA's of 30days. I can remove the ops-codfw tag and remove myself from the task and assign it to you to do the install when ready and i can consider this done on my end.

Please don't do that; I'll try and get back to you before the end of the week.

Thank you.

Change 894009 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

gerritbot added a project: Patch-For-Review.Mar 3 2023, 9:49 AM

Change 894009 merged by MVernon:

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

@Papaul there's now a partman recipe for these new nodes (see the above merged CR).

Maintenance_bot removed a project: Patch-For-Review.Mar 3 2023, 10:11 AM

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye

@MatthewVernon on ms-be2070 the OS install did complete with no issues using the partman recipe and server did boot into the OS. However, after the puppet run and after the second reboot by the cookbook the server is stuck in a continuous boot loop and not booting into the OS anymore.

Before the the OS install the boot device was set to first SSD and this boot without any issue into the OS but got to the continuous boot loop after the puppet run and the second reboot by the cookbook.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with errors:

ms-be2070 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303031622_pt1979_3127564_ms-be2070.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye

on the second run i got

Booting from Hard drive C:
GRUB

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye executed with errors:

ms-be2070 (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303031858_pt1979_3201152_ms-be2070.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- The reimage failed, see the cookbook logs for the details

@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.

But then the first puppet run seemingly does something that hoses the boot process - console has

Booting from Hard drive C:
GRUB

and is entirely unresponsive.

A quick look through the puppet log isn't very enlightening :-/

In T326352#8669104, @MatthewVernon wrote:
@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.

But then the first puppet run seemingly does something that hoses the boot process - console has
Booting from Hard drive C:
GRUB

@MatthewVernon i havn't looked in depth but my best guess is that yu need to set profile::swift::storage::disks_by_path: true for the specific hosts. This has puppet use profile::swift::storage::configure_disks and not swift::init_device which i suspect is what is wiping out the partitions. fyi you will also need to do an update to modules/swift/files/{eqiad,codfw}-prod_hosts.yaml at some point as well

d'oh, that seems likely, thank you!

[yes we'll need a new storage schema in hosts.yaml, but that's when actually bringing into service, which isn't really a priority ATM]

Change 895141 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

gerritbot added a project: Patch-For-Review.Mar 7 2023, 10:33 AM

Change 895141 merged by MVernon:

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2023, 12:11 PM

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye completed:

ms-be2070 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071144_mvernon_2594352_ms-be2070.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active

@Papaul I've fixed the underlying problems and you'll see ms-be2070 reimaged to successful completion now, so hopefully that's you unblocked here.

...the icinga warning was systemd timing out waiting for smartd to start up (takes about 2 minutes).

Change 895309 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] smart: override unit to make systemd wait longer

https://gerrit.wikimedia.org/r/895309

gerritbot added a project: Patch-For-Review.Mar 7 2023, 3:30 PM

Change 895309 merged by MVernon:

[operations/puppet@production] smart: override unit to make systemd wait longer

https://gerrit.wikimedia.org/r/895309

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2023, 4:11 PM

@MatthewVernon thank you I will try on ms-be2071 and let you know

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye

@MatthewVernon looks like ms-be2071 is happy second reboot got the server back into the OS so just waiting for it to finish now.

@MatthewVernon puppet is failing with the error below on ms-be2071

Error: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:18:00.0-scsi-0:0:0:0]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1' returned 1 instead of one of [0] (corrective)

Yeah, I saw similar on ms-be2070; the problem being the disk isn't entirely blank. I suspect

sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1
sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0

will do the trick.

Thanks that fixed the issue.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye completed:

ms-be2071 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071624_pt1979_1848725_ms-be2071.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye completed:

ms-be2072 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303071826_pt1979_1944866_ms-be2072.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye completed:

ms-be2073 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303072029_pt1979_2045062_ms-be2073.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active

Papaul updated the task description. (Show Details)Mar 7 2023, 9:42 PM

@MatthewVernon all yours thank you for getting the partman recipe

MatthewVernon mentioned this in T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.Mar 10 2023, 2:45 PM

MatthewVernon mentioned this in T335278: Bring ms-be207[0-3] into the rings.Apr 24 2023, 10:54 AM

MatthewVernon mentioned this in T342674: Q1:rack/setup/install moss-be200[34].Sep 27 2023, 2:11 PM

Q3:rack/setup/install ms-be207[0-3]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

ms-be2070: Rack A4 - U9 - Port 8

ms-be2071: Rack B4 - U11 - Port 10

ms-be2072: Rack C4 - U9 - Port 8

ms-be2073: Rack D7 - U9 - Port 7

Details

Related Objects
Search...

Event Timeline

		Status	Subtype	Assigned	Task
					Unknown Object (Task)
		Resolved		Papaul	T326352 Q3:rack/setup/install ms-be207[0-3]

Q3:rack/setup/install ms-be207[0-3]Closed, ResolvedPublicActions