Page MenuHomePhabricator

Q1:rack/setup/install thanos-be2005
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of thanos-be2005

Hostname / Racking / Installation Details

Hostnames: thanos-be2005
Racking Proposal: Cannot share with any other thanos-be
Networking Setup: # of Connections:1 Speed:10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level: Set all disks to JBOD
OS Distro: Bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

thanos-be2005
  • Receive in system on procurement task T368445 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Change #1058092 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] site.pp: new thanos backends are safe to add to thanos::backend

https://gerrit.wikimedia.org/r/1058092

Change #1058092 merged by MVernon:

[operations/puppet@production] site.pp: new thanos backends are safe to add to thanos::backend

https://gerrit.wikimedia.org/r/1058092

Change #1087949 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes

https://gerrit.wikimedia.org/r/1087949

Change #1087949 merged by MVernon:

[operations/puppet@production] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes

https://gerrit.wikimedia.org/r/1087949

While provisioning I see the following error for the BMC NIC config:

Error: {'error': {'code': 'Base.v1_10_3.GeneralError', 'Message': 'A general error has occurred. See ExtendedInfo for more information.', '@Message.ExtendedInfo': [{'MessageId': 'Base.1.10.PropertyNotWritable', 'Severity': 'Warning', 'Resolution': 'Remove the property from the request body and resubmit the request if the operation failed.', '

Message': 'The property StaticNameServers is a read only property and cannot be assigned a value.', 'MessageArgs': ['StaticNameServers'], 'RelatedProperties': ['']}, {'MessageId': 'Base.1.10.PropertyNotWritable', 'Severity': 'Warning', 'Resolution': 'Remove the property from the request body and resubmit the request if the operation failed.', 

'Message': 'The property StatelessAddressAutoConfig is a read only property and cannot be assigned a value.', 'MessageArgs': ['StatelessAddressAutoConfig'], 'RelatedProperties': ['']}, {'MessageId': 'Base.1.10.PropertyUnknown', 'Severity': 'Warning', 'Resolution': 'Remove the unknown property from the request body and resubmit the request if the operation failed.', 'Message': 'The property StatelessAddressAutoConfig is not in the list of valid properties for the resource.', 'MessageArgs': ['StatelessAddressAutoConfig'], 'RelatedProperties': ['StatelessAddressAutoConfig']}, {'MessageId': 'Base.1.10.PropertyUnknown', 'Severity': 'Warning', 'Resolution': 'Remove the unknown property from the request body and resubmit the request if the operation failed.', 

'Message': 'The property StaticNameServers is not in the list of valid properties for the resource.', 'MessageArgs': ['StaticNameServers'], 'RelatedProperties': ['StaticNameServers']}]}}
Traceback (most recent call last):

The firmware version is 06.01.34, while on ms-be2088 (same config J spec etc..) we have 06.04.04.

@Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly this instead of the custom one. I tried to connect to the BMC web ui in various ways but I failed since the BMC network config is the one that fails while provisioning. I tried also to do it by hand via DEL/Setup at boot but for some reason I cannot modify any value (or, my client prevents me to do it remotely, not sure why).

If you are able to do it manually let me know, after that the provision cookbook should restart working :)

I was able to upload the firmware via Web UI, but the issue seems still present (new version, 01.04.08. Need to investigate more what is the problem, and/or to ping supermicro to give us the same firmware that they deployed to the ms-be nodes.

My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes, too many firmware floating around :D) and I uploaded it to thanos-be2005, followed by a factory reset. The issue is the same as happened on backup1012: T371416#10216617

The new firmware is 06.04.04 and I see the calvin default password as well, but it still doesn't work (namely, same error as highlighted above).

Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upgrading the firmware. With the BMC reset the network settings are preserved, so the old test/setting caused the last hiccup in running provision.

So, upgrading the firmware to the right one works! I am going to kick off reimage later on to see if UEFI works fine.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX Forced UEFI regular Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm

@elukey, unfortunately I observed the same double d-i installer issue with thanos-be2005. Grub's installer does not throw any errros, but upon reboot the debian boot option is last in the boot order. I suspect that https://www.supermicro.com/support/faqs/faq.cfm?faq=27004 is still true, namely that you cannot affect the boot order from within Debian or httpboot once messes up something in the bios. I submitted a question on that ticket, but we should go through our regular support channel as well. Though, I'm also not sure if that FAQ explains all the behaviors we have seen.

One possible work around is forcing an Hdd boot once rather than continous, following the Debian installer. I tested this and it seemed to work properly, but I would like to re-image again as well as testing on a fresh node.

I plan to do more testing on thanos-be2005.codfw.wmnet tomorrow.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX HDD Forced UEFI regular Boot for next reboot
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411182319_jhathaway_2087797_thanos-be2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX FOREVER HDD Forced UEFI regular Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX FOREVER HDD Forced UEFI regular Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye completed:

  • thanos-be2005 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411202147_jhathaway_2554797_thanos-be2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye completed:

  • thanos-be2005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411202227_jhathaway_2560287_thanos-be2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@elukey thanos-be2005 is now re-imaging without any user intervention. It wasn't quite as easy as just running the re-image script twice, I still had problems actually booting into debian. But, I lost track of the error states. Perhaps the cause was artifacts of my earlier testing. Hopefully, your re-imaging doesn't have any issues.

Change #1093884 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add new backends to profile::thanos::swift::backends

https://gerrit.wikimedia.org/r/1093884

Change #1093885 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: storage schema for larger disks_by_path backends, add 2

https://gerrit.wikimedia.org/r/1093885

Jhancock.wm claimed this task.

Change #1093884 merged by MVernon:

[operations/puppet@production] thanos: add new backends to profile::thanos::swift::backends

https://gerrit.wikimedia.org/r/1093884

Host rebooted by mvernon@cumin2002 with reason: prep for prod

Host rebooted by mvernon@cumin2002 with reason: prep for prod

Change #1093885 merged by MVernon:

[operations/puppet@production] thanos: storage schema for larger disks_by_path backends, add 2

https://gerrit.wikimedia.org/r/1093885