Page MenuHomePhabricator

es1022 troubles with PXE
Closed, ResolvedPublic

Description

When trying to reimage es1022 it I have found several things:

  • PXE boot wasn't working until I did it manually - it attempted it and then I guess (the screen was going blank) it timed out and booted from disk
  • After the reimage, then trying to remove the override for PXE, it failed with:
Running IPMI command: ipmitool -I lanplus -H es1022.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 487, in run
    self.ipmi.check_bootparams()
  File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 125, in check_bootparams
    raise IpmiCheckError(f"Expected BIOS boot params in {IPMI_SAFE_BOOT_PARAMS} got: {param}")
spicerack.ipmi.IpmiCheckError: Expected BIOS boot params in ('0000000000', '8000020000') got: 0000020000

If possible I would like to:

  • Get bios and firmware upgraded
  • Verify that PXE is configured to boot from the ethernet which has this MAC: b0:26:28:f5:35:dc

Please coordinate with us on a day/time so we can shutdown mysql.

Thanks

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to Blocked on the DBA board.

@Marostegui Can we schedule this for me to power down tomorrow (20 Jan) 1530UTC?

@Marostegui Can we schedule this for me to power down tomorrow (20 Jan) 1530UTC?

Sounds good. I will leave the host powered down for you, thanks!

Mentioned in SAL (#wikimedia-operations) [2022-01-20T08:18:09Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1022 for on-site maintenance T299123', diff saved to https://phabricator.wikimedia.org/P18917 and previous config saved to /var/cache/conftool/dbconfig/20220120-081809-marostegui.json

Change 755630 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/755630

Change 755630 merged by Marostegui:

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/755630

Mentioned in SAL (#wikimedia-operations) [2022-01-20T13:55:47Z] <marostegui> Power off es1022 for onsite maintenance T299123

@Marostegui BIOS and network Firmware updated, this should fix your issue. I will leave task open until you confirm all is well.

Thanks Chris - I will try a reimage on Monday to see if it PXE boots fine.
I have started mysql now so it can start catching up

Mentioned in SAL (#wikimedia-operations) [2022-01-24T06:02:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1022 T299123', diff saved to https://phabricator.wikimedia.org/P18980 and previous config saved to /var/cache/conftool/dbconfig/20220124-060248-marostegui.json

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye completed:

  • es1022 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201240605_marostegui_6269_es1022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@Cmjohnson the host keeps ignoring PXE boot even if it attempts to do so from the boot menu. Not sure what could be root cause for this. It only works if selected manually via IDRAC by pressing F12. Maybe something within the PXE Boot card or options on the BIOS?
Let me know which date/time you'd like to try to troubleshoot this so I can make sure the host is off for you.

Marostegui this actually seems like a script issue, you may want to ping @Volans

@Volans any thoughts? I can try this reimage with you if that'd help with the troubleshooting.

Mentioned in SAL (#wikimedia-operations) [2022-01-25T10:29:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1022 T299123', diff saved to https://phabricator.wikimedia.org/P19119 and previous config saved to /var/cache/conftool/dbconfig/20220125-102912-marostegui.json

Change 756956 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/756956

Change 756956 merged by Marostegui:

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/756956

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

The issue here was that both NIC.Integrated.1-1-1 and NIC.Integrated.1-3-1 had the LegacyBootProto set to PXE while the host has a cable only on the 3rd NIC (see Netbox ).
I've used the Redfish API support in Spicerack to change that:

INFO:spicerack.redfish:Updated value for attribute NIC.Integrated.1-1-1 -> LegacyBootProto: PXE => NONE

And trigger a new reimage that is currently running. The automation for new hosts (new sre.hosts.provision cookbook) should ensure there is only one NIC with PXE set on it for newly provisioned hosts.

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye completed:

  • es1022 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251534_volans_7987_es1022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Root cause found, problem solved, host reimaged. Resolving.

Thank you @Volans for troubleshooting this issue!

I think es1020 is having the same issue - @Volans do you have the fixing command somewhere?
Given that es1020 and es1022 are from the same batch it could make sense. Also, expecting es1021 (current master) to also have that misconfiguration.

I've fixed es1020 manually and checked the other hosts in the same batch ( https://netbox.wikimedia.org/dcim/devices/?cf_ticket=T235659 ), apart es1024 all of them have the same misconfiguration.
I tested if I could just run the provisioning cookbook on them to fix them and found a couple of issues (one bug in Dell's reply to Redfish API that I need to workaround and a use case not fully supported in the cookbook).
I'm fixing them so we can test the cookbook on the next hosts to reimage of the batch.

Change 757410 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.provision: disable PXE on all other NICs

https://gerrit.wikimedia.org/r/757410

Change 757410 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: disable PXE on all other NICs

https://gerrit.wikimedia.org/r/757410

Change 757435 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] redfish: better support of parsing JSON responses

https://gerrit.wikimedia.org/r/757435

Change 757435 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: better support of parsing JSON responses

https://gerrit.wikimedia.org/r/757435