Page MenuHomePhabricator

Q1:rack/setup/install dse-k8s-worker2003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: dse-k8s-worker2003
Racking Proposal: Any row except C or D (other dse-k8s-workers are in that row)
Networking Setup: 1x 10G on private VLAN
OS Distro: Bookworm (default unless otherwise specified)
Boot Method: UEFI
Sub-team Technical Contact: @bking (inflatador on IRC) or @BTullis (btullis on IRC)

Per host setup checklist

dse-k8s-worker2003:
  • Receive in system on procurement task T399104 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

@bking or @BTullis,

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH added a parent task: Unknown Object (Task).Jul 16 2025, 10:23 PM

Change #1170279 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014

https://gerrit.wikimedia.org/r/1170279

Change #1170279 merged by Brouberol:

[operations/puppet@production] site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014

https://gerrit.wikimedia.org/r/1170279

note to myself. These must be racked in DH7 cage.

Change #1187788 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] preseed: fix partman config for dse-k8s-worker2003

https://gerrit.wikimedia.org/r/1187788

Change #1187789 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] Add dse-k8s-worker2003 to preseed.yaml

https://gerrit.wikimedia.org/r/1187789

Change #1187789 abandoned by Jclark-ctr:

[operations/puppet@production] Add dse-k8s-worker2003 to preseed.yaml

Reason:

Duplicate to 1187788

https://gerrit.wikimedia.org/r/1187789

Change #1187788 merged by Elukey:

[operations/puppet@production] preseed: fix partman config for dse-k8s-worker2003

https://gerrit.wikimedia.org/r/1187788

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

[28/50, retrying in 84.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title dse-k8s-worker2003 not found yet

going to fail. issue with the puppet? need to investigate.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

@bking can you check the site.pp and preseed.yaml files for accuracy? the reimage cookbook is acting like there's a possible misconfig there. thank you!

@Jhancock.wm I think we had a similar ticket for the same hardware in EQIAD (T399779) . I'll take a look there and see if we missed anything.

Change #1191441 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Add dse-k8s-worker2003 to site.pp

https://gerrit.wikimedia.org/r/1191441

Change #1191441 merged by Bking:

[operations/puppet@production] Add dse-k8s-worker2003 to site.pp

https://gerrit.wikimedia.org/r/1191441

@Jhancock.wm it looks like the host was missing from site.pp. I've added it, and you should be good to go now. Feel free to ping me on IRC (inflatador) if you run into any other issues.

bking updated Other Assignee, added: bking.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

@Papaul Hey we've gotten the pressed and site.pp files cofigured correctly as far as i can tell but still getting this on the reimage script. Could you take a look for me when you have time? thanks.

[46/50, retrying in 138.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title dse-k8s-worker2003 not found yet

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

@Jhancock.wm see below why the server is failing. You have 2 options change the role int site.pp to insetup role to finish the install or have the server owner fix the puppet error below.

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/puppet_code/environments/production/modules/profile/manifests/kubernetes/node.pp, line: 140, column: 15) on node dse-k8s-worker2003.codfw.wmnet

can you help me out with what papaul pointed out when you get in? thanks!

Grabbing the ticket per IRC conversation with @Jhancock.wm

Change #1194231 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] dse-k8s-worker2003: move back to insetup

https://gerrit.wikimedia.org/r/1194231

Change #1194231 merged by Bking:

[operations/puppet@production] dse-k8s-worker2003: move back to insetup

https://gerrit.wikimedia.org/r/1194231

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm completed:

  • dse-k8s-worker2003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510071557_bking_529588_dse-k8s-worker2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Mentioned in SAL (#wikimedia-operations) [2025-10-07T22:11:48Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "running per cookbook error suggestion - bking@cumin2002 - T399778"

Mentioned in SAL (#wikimedia-operations) [2025-10-07T22:12:08Z] <bking@cumin2002> END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "running per cookbook error suggestion - bking@cumin2002 - T399778"

Change #1194305 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] dse-k8s-worker2003: return to production role

https://gerrit.wikimedia.org/r/1194305

I verified that this host is ready for production, so no need to worry about the above cookbook failure.

Change #1194305 merged by Bking:

[operations/puppet@production] dse-k8s-worker2003: return to production role

https://gerrit.wikimedia.org/r/1194305

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm completed:

  • dse-k8s-worker2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510092023_bking_2138859_dse-k8s-worker2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

This host has been reimaged as a dse-k8s worker, so I'm closing this out. Work to make this host ready for production continues in T406985...