Page MenuHomePhabricator

Q2:rack/setup/install cloudelastic101[12]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudelastic101[12]

Hostname / Racking / Installation Details

Hostnames cloudelastic101[12].eqiad.wmnet
Racking Proposal: Avoid sharing rows with each other and cloudelastic1007-1010, if possible.
Networking Setup: : 1 10G connection, private VLAN
OS Distro: Bullseye
Sub-team Technical Contact: Brian King (IRC: inflatador) or Ryan Kemper (IRC: ryankemper)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudelastic1011
  • Receive in system on procurement task T376166 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudelastic1012
  • Receive in system on procurement task T376166 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

Event Timeline

RobH assigned this task to bking.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH edited subscribers, added: RKemper; removed: RobH.

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup.

Thank you!

Change #1084253 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search platform: add config for new search platform hosts

https://gerrit.wikimedia.org/r/1084253

Change #1084253 merged by Bking:

[operations/puppet@production] search platform: add config for new search platform hosts

https://gerrit.wikimedia.org/r/1084253

bking removed bking as the assignee of this task.Oct 29 2024, 10:03 PM
bking added subscribers: RobH, bking.

CR for new hosts merged per @RobH 's instructions above. Unassigning...

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic1012.eqiad.wmnet with OS bullseye

Jclark-ctr added subscribers: elukey, Jclark-ctr.

@elukey Hey luca these two are failing to provision these are custom configs

The error seems to be related to a specific network card:

PATCH https://10.65.4.200/redfish/v1/Systems/1/Bios returned HTTP 400
Response payload: {'error': {'code': 'Base.v1_10_3.GeneralError', 'message': 'A general error has occurred. See ExtendedInfo for more information.', '@Message.ExtendedInfo': [{'MessageId': 'Base.1.10.PropertyValueTypeError', 'Severity': 'Warning', 'Resolution': 'Correct the value for the property in the request body and resubmit the request if the operation failed.', 'Message': "The value 'null' for the property P1_AIOMAOC_ATGC_i2TMLAN1OPROM is of a different type than the property can accept.", 'MessageArgs': ['null', 'P1_AIOMAOC_ATGC_i2TMLAN1OPROM'], 'RelatedProperties': ['P1_AIOMAOC_ATGC_i2TMLAN1OPROM']}]}}

Note: for some reason in MessageArgs you see null but in reality we pass Legacy.

Running the cookbook with uefi makes everything working, need to figure out what is special about P1_AIOMAOC_ATGC_i2TMLAN1OPROM.

@Jclark-ctr if those are not urgent I'd ask you to leave them to me for some tests, I'll ping you when I'll find something!

I am reviewing the quote of these nodes to figure out what the item is, afaics it seems a 10G network card, but I also see a 25G capable one (with two ports, that is currently being used). Is it the case?

I am reviewing the quote of these nodes to figure out what the item is, afaics it seems a 10G network card, but I also see a 25G capable one (with two ports, that is currently being used). Is it the case?

From what I can see in the BIOS config we have two Intel-based-10G ports and two Supermicro-based-25G ports, and the link is present on the first of the 25G port. If I enter in any of the Intel-based-10G ports menu' I only see "UEFI Driver" "XXXX" and nothingelse related to Legacy etc.., so I suspect that these NICs don't work in a different mode.

Change #1101457 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: add uefi only devices for Supermicro

https://gerrit.wikimedia.org/r/1101457

Change #1101457 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: add uefi only devices for Supermicro

https://gerrit.wikimedia.org/r/1101457

@Jclark-ctr @bking I updated the provision cookbook to support this case, but the TL;DR is that we may need to use UEFI to avoid weird configurations. In this case we can keep legacy, but I am not sure if we'll ever be able to PXE boot from one of the 10G interfaces (they are not used atm, we use the 25G ones, and we'll probably keep using those in the future, but..).

@elukey I'm fine with focusing our efforts on UEFI, it seems like the best use of our time. Ping me in IRC if I can do anything to help test.

@elukey the 10g card is copper rj45 and not in use. AOC-ATGC-i2TM. The 10g port is connected using DAC cable to AOC-A25G-b2SM.

@Jclark-ctr @bking given that the 10g card will never be used (Rj45, coppet, etc..) we can go ahead without UEFI, since that NIC will not be configured and it will not matter. The provision cookbook should run just fine, warning about the UEFI-only device for awareness. We can skip UEFI for the moment, it doesn't seem really needed, let's see how it goes.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudelastic1012.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1012 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1012.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudelastic1011.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1011 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1011.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1103367 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] WIP: Add partman recipe for raid 0 with EFI

https://gerrit.wikimedia.org/r/1103367

Change #1103381 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic10[12]: add hosts to new efi-based partman recipe

https://gerrit.wikimedia.org/r/1103381

Change #1103367 merged by Bking:

[operations/puppet@production] Add partman recipe for raid 0 with EFI

https://gerrit.wikimedia.org/r/1103367

Change #1103381 merged by Bking:

[operations/puppet@production] cloudelastic10[12]: add hosts to new efi-based partman recipe

https://gerrit.wikimedia.org/r/1103381

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1011 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1011.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1011 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1011.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1012 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1012.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS bullseye completed:

  • cloudelastic1012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412162119_bking_307948_cloudelastic1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Looks like these are ready to go into service, should they be reassigned to @bking?

(This drive-by brought to you by me looking through the WMCS procurement budget and trying to make sure everything is on track)

bking changed the task status from Open to In Progress.Jan 8 2025, 2:47 PM
bking claimed this task.

@Andrew , you are correct. I just assigned this to myself and we'll work on getting these into hosts into production.

Change #1109483 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: add cloudelastic10[12] into production

https://gerrit.wikimedia.org/r/1109483

We are currently getting a PCC error for cloudelastic1011, as its Netbox status is incorrectly set to 'planned' (should be 'active').

Mentioned in SAL (#wikimedia-operations) [2025-01-09T21:43:02Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368"

Mentioned in SAL (#wikimedia-operations) [2025-01-09T21:44:24Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368"

Mentioned in SAL (#wikimedia-operations) [2025-01-09T23:00:09Z] <inflatador> bking@puppetserver1001:~$ sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 T378368

Mentioned in SAL (#wikimedia-operations) [2025-01-09T23:00:23Z] <inflatador> bking@pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor T378368

Change #1109483 merged by Bking:

[operations/puppet@production] cloudelastic: add cloudelastic10[12] into production

https://gerrit.wikimedia.org/r/1109483

bking closed this task as Resolved.EditedJan 10 2025, 10:04 PM

I'm happy to report that cloudelastic1011 and cloudelastic1012 are running in production. We will decom the hosts they replaced (cloudelastic100[5-6]) in T380937. Closing...

Change #1110862 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: remove cloudelastic100[56] from conftool, add 101[12]

https://gerrit.wikimedia.org/r/1110862