Page MenuHomePhabricator

Q4: rack/setup/install stat1010
Closed, ResolvedPublic1 Estimated Story Points

Assigned To
Authored By
RobH
May 2 2022, 9:40 PM
Referenced Files
F35286599: image.png
Jun 29 2022, 2:06 PM
F35286587: image.png
Jun 29 2022, 1:58 PM
F35286016: image.png
Jun 29 2022, 8:24 AM
F35286013: image.png
Jun 29 2022, 8:24 AM
F35283525: image.png
Jun 28 2022, 12:48 PM
F35283413: image.png
Jun 28 2022, 11:31 AM
F35280906: image.png
Jun 27 2022, 1:56 PM

Description

This task will track the racking, setup, and OS installation of stat1010.eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: stat1010.eqiad.wmnet
Racking Proposal: Replacing stat1005 - no restriction on location
Networking/Subnet/VLAN/IP: Single 10G network connection - analytics vlan please
Partitioning/Raid: RAID 1 pair for O/S - RAID 10 on four disks for /srv - hardware RAID
OS Distro: Bullseye

Per host setup checklist

stat1010:
  • - receive in system on procurement task T297736 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) **n.b. NIC firmware downgraded. See: T304483 for details. BIOS and RAID firmware updated. No iDRAC update available.
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.
RobH added a parent task: Unknown Object (Task).May 2 2022, 9:43 PM

stat1010 E1 u24 cableid # 20220077 port24

@BTullis please confirm if New rows E- F are ok for this host.

These should be ok for rows E/F if that suits the team.

Yes, rows E and F are fine for this, thanks.

@BTullis Can you confirm raid configuration and partman recipe to use please?

@Cmjohnson yes please, let's use hardware RAID for this please. As @RobH suggested in the parent task, let's...

use the flex bays as a raid1 for the OS data, and then the 4*4TB as raid10 for /srv

As for the partman recipe, I think that the dumpsdata100XH750.cfg would be perfect, apart from the fact that it mounts the RAID10 volume to /data instead of /srv.

I was wondering whether it would work to combine standard.cfg and hwraid-seconddev.cfg but I'm a bit confused over whether /dev/sda and /dev/sdb are swapped by the hw RAID. Have you any guidance here? Thanks.

@BTullis I don't have any real guidance for you other than all disks are controlled by the raid controller. Partman recipes are not a specialty of mine. pinging @RobH he may be able to provide more guidance.

@BTullis @RobH was working on this last wee. /dev/sda and /dev/sdb are swapped by the controller regardless of how they were inputted. It appears a partman recipe change may need to be made. We are stuck at the moment. Once that is fixed the OS should install without an issue.

Change 808545 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding stat1009 and stat1010 to site.pp

https://gerrit.wikimedia.org/r/808545

Change 808545 abandoned by Cmjohnson:

[operations/puppet@production] Adding stat1009 and stat1010 to site.pp

Reason:

https://gerrit.wikimedia.org/r/808545

Change 808870 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new partman recipe for the new H750 based stat servers

https://gerrit.wikimedia.org/r/808870

Thanks @Cmjohnson for your replies. I've created a new partman recipe in https://gerrit.wikimedia.org/r/808870 although I haven't yet tested it.
If I'm right it should be OK for both stat1009 and stat1010, despite the difference in the number of disks beneath /srv

I'm happy to merge it and run the sre.hosts.reimage cookbook myself, if that suits you. Or I'm happy to leave it to you.
If you'd like me to run it to test it out, should I also be configuring the RAID over the serial console, or is that already done?

Apologies for all of the questions, I just don't want to step in where help isn't unwanted, or to mess up your existing way of working. Thanks.

Change 808870 merged by Btullis:

[operations/puppet@production] Add a new partman recipe for the new H750 based stat servers

https://gerrit.wikimedia.org/r/808870

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye

I've attempted to run the cookbook to install this server, but it's failing at the TFTP step, I believe.

image.png (313×717 px, 30 KB)

The preceding parts of the cookbook appeared to work, where it wriote the temporary DHCP fragment.

btullis@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye --new -t T307399 stat1010
==> ATTENTION: destructive action for host: stat1010
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Management Password:
Running IPMI command: ipmitool -I lanplus -H stat1010.mgmt.eqiad.wmnet -U root -E chassis power status
START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye
Updated Phabricator task T307399
----- OUTPUT of 'puppet node clea...1010.eqiad.wmnet' -----
stat1010.eqiad.wmnet
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1010.eqiad.wmnet'.
----- OUTPUT of 'puppet node deac...1010.eqiad.wmnet' -----
Submitted 'deactivate node' for stat1010.eqiad.wmnet with UUID 0b59536f-bce2-451c-8d5b-9ecee7dfb3e4
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1010.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet and PuppetDB if present
----- OUTPUT of 'puppet ca --disa...1010.eqiad.wmnet' -----
Nothing was deleted
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet ca --disa...1010.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted any existing Puppet certificate
Host stat1010.eqiad.wmnet already missing on Debmonitor
Removed from Debmonitor if present
----- OUTPUT of '/bin/echo 'Cmhvc...00/stat1010.conf' -----
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/echo 'Cmhvc...00/stat1010.conf'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/usr/local/sbin/...cludes -r commit' -----
2022-06-27 13:34:54,990 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS0-115200.conf
2022-06-27 13:34:54,991 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS1-115200.conf
2022-06-27 13:34:54,992 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqiad.conf
2022-06-27 13:34:54,992 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-ulsfo.conf
2022-06-27 13:34:54,992 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-codfw.conf
2022-06-27 13:34:54,992 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-esams.conf
2022-06-27 13:34:54,993 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqsin.conf
2022-06-27 13:34:54,993 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-drmrs.conf
Internet Systems Consortium DHCP Server 4.4.1
Copyright 2004-2018 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Config file: /etc/dhcp/dhcpd.conf
Database file: /var/lib/dhcp/dhcpd.leases
PID file: /var/run/dhcpd.pid
2022-06-27 13:34:55,042 [INFO] dhcp config test passed!
2022-06-27 13:34:57,127 [INFO] reloaded isc-dhcp-server
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors:

  • stat1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors:

  • stat1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

It looks like this might be an instance of the bug identified in T304483: PXE boot NIC firmware regression
I'm downgrading the NIC firmware to the previous version and then I will run the cookbook again.

image.png (924×1 px, 240 KB)

@BTullis @RobH was working on this last wee. /dev/sda and /dev/sdb are swapped by the controller regardless of how they were inputted. It appears a partman recipe change may need to be made. We are stuck at the moment. Once that is fixed the OS should install without an issue.

I was reviewing datadumps1007 and the new hw controller in T302937: datadumps1007 test installs and came across this task too. My understanding is that this being a new controller some differences are to be expected, though before going in an specialising all of our recipes and deviating from "sda is the OS drive for hw raid controllers" convention, could we check in with Dell if the ordering change is expected? I think that'll save us some time and head scratching down the line, thanks!

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye

Hi @fgiunchedi - Thanks, I agree that it would be a pain to have to deviate from using /dev/sda for the primary OS drive.
At the moment I have copied the only previous H750 configuration that I new about (that of datadumps1007) so that I could use something when installing stat1010 but I will use this opportunity to try to swap the drive letters back.

So far I've verified that the RAID controller is set up as I would have imagined, with the 2 SFF flex bay drives configured as Disk Group #0 and the LFF drives as Disk Group #1.
There don't seem to be any other configurable parameters regarding ordering here.

image.png (463×656 px, 56 KB)

I'll try out the recipe that I've created to see if it behaves as per the other H750 host (reversed /dev/sda and /dev/sdb). If it does than I'll look for any other ways to swap them back and/or get Dell involved.
We're not in a hurry for this particular server to come online, but if we need to identify a workaround then it would be better to do that sooner rather than later, because we have quite a few other H750 servers waiting to be installed.

Any other suggestions from anyone welcome.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors:

  • stat1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye

BTullis added a subscriber: Cmjohnson.

I'm just claiming this ticket and putting it on our team's workboard to reflect the fact that I'm working on it right now. Hope that's ok @Cmjohnson.

I'll update the ticket with any discoveries that I make around the disk ordering, partman recipes, and the cookbook.

Well the partitioning recipe didn't do what we wanted to anyway.

  • /dev/sda is the big RAID10 drive (as we suspected)
  • /dev/sdb is the smaller RAID1 drive (as we suspected)
  • Both the root and /srv volumes were created on /dev/sda (which was not expected).
  • grub didn't get correctly installed, so it didn't boot. I had to boot to a rescue environment to capture this:
root@stat1010:/# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0   7.3T  0 disk
|-sda1         8:1    0   953M  0 part /boot
`-sda2         8:2    0   7.3T  0 part
  |-vg0-root 253:0    0  74.5G  0 lvm  /
  `-vg0-srv  253:1    0  93.1G  0 lvm
sdb            8:16   0 446.6G  0 disk
`-sdb1         8:17   0 446.6G  0 part
root@stat1010:/#

pvs shows that /dev/sdb was added to the existing vg0 volume group.

root@stat1010:/# pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda2  vg0 lvm2 a--   <7.28t   7.11t
  /dev/sdb1  vg0 lvm2 a--  446.62g 446.62g
root@stat1010:/#

That is confirmed by vgs showing that it contains two PVs.

root@stat1010:/# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  vg0   2   2   0 wz--n- 7.71t <7.55t

I'll investigate the install logs, to see what I can glean.

Thank you for the investigation @BTullis ! In case you haven't come across it yet: the partman/custom/kafka-jumbo.cfg configuration would be compatible with what you are trying to achieve here (modulo disk ordering!)

I'm going to try updating the RAID controller firmware, then the BIOS on stat1010, to see if either of these fixes the drive ordering issue.
These are the current versions of all firmware.

image.png (802×1 px, 390 KB)

The Dell website lists two urgent updates for these two components.
image.png (133×1 px, 22 KB)

Change 809602 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the partman recipe for use with the new stat servers

https://gerrit.wikimedia.org/r/809602

The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device names.
Essentially, we need to create the RAID devices in reverse order from that we would like them to be discovered on boot by the operating system.

From the Dell docs on this page...

image.png (267×883 px, 63 KB)

I will add some more notes to T297913: Confirm support of PERC 750 raid controller and move forward by applying the partman/custom/kafka-jumbo.cfg recipe to this host, as suggested.

Change 809602 merged by Btullis:

[operations/puppet@production] Update the partman recipe for use with the new stat servers

https://gerrit.wikimedia.org/r/809602

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors:

  • stat1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

I'm updating the BIOS as well from version 2.13.3 to version 2.14.2 since it was marked as urgent by Dell.

image.png (120×698 px, 32 KB)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye

The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device names.
Essentially, we need to create the RAID devices in reverse order from that we would like them to be discovered on boot by the operating system.

From the Dell docs on this page...

image.png (267×883 px, 63 KB)

I will add some more notes to T297913: Confirm support of PERC 750 raid controller and move forward by applying the partman/custom/kafka-jumbo.cfg recipe to this host, as suggested.

Neat, i found it made the SSDs a higher ID # no matter what order I created them in though for dumpsdata1007. Hopefully this works!

I tried the partman/custom/kafka-jumbo.cfg partman recipe on this how, but it didn't seem to be applied.

When I checked the log I saw this, which explains it:
Jun 29 15:11:09 partman-auto: Available disk space (8480010) too small for expert recipe (18900300); skipping

I believe that I can reduce the minimum size of the /srv volume definition to 6TB and try again.

Change 809640 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reduce the minimum size of /srv in the kafka-jumbo recipe

https://gerrit.wikimedia.org/r/809640

Change 809640 merged by Btullis:

[operations/puppet@production] Reduce the minimum size of /srv in the kafka-jumbo recipe

https://gerrit.wikimedia.org/r/809640

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye completed:

  • stat1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291412_btullis_760205_stat1010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
EChetty set the point value for this task to 1.Jun 30 2022, 4:56 PM

I have manually moved all home directories from /home to /srv/home and created a symlink.
This matches the configuration of all of the other stat servers.

We can't bring this host into service until we have completed this ticket: T310643: Build Bigtop 1.5 Hadoop packages for Bullseye

However stat1010 is now ready and in the insetup role, so I think we can resolve this ticket.

BTullis updated the task description. (Show Details)