Page MenuHomePhabricator

Q1:rack/setup/install deploy2003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of deploy2003

Hostname / Racking / Installation Details

Hostnames: deploy2003
Racking Proposal: Any rack is fine
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private
OS Distro: Bookworm
Boot Method: Legacy BIOS
Sub-team Technical Contact: @Clement_Goubert @jasmine_

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

deploy2003
  • Receive in system on procurement task T399022 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
ResolvedJhancock.wm

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

@Clement_Goubert,

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1172660 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] deploy2003: Add to site.pp

https://gerrit.wikimedia.org/r/1172660

Change #1172660 merged by Clément Goubert:

[operations/puppet@production] deploy2003: Add to site.pp

https://gerrit.wikimedia.org/r/1172660

Clement_Goubert added a subscriber: RobH.

@Clement_Goubert,

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Done :)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors:

  • deploy2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console deploy2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors:

  • deploy2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console deploy2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm

@Papaul this one is going to fail again. looks like there might be a missmatch between hardware and the site.pp or preseed. I'm not sure which, but they both exist in those files.

[32/50, retrying in 96.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title deploy2003 not found yet

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors:

  • deploy2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console deploy2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

@Jhancock.wm no entry on the wrong puppet server for this server. Please check site.pp. Thanks

@Clement_Goubert i think their might be a mismatch regarding the site.pp on this one. I am not sure exactly what it is. we had a similar issue in T400195. Could you check it for me when you have a chance? thanks!

Change #1183124 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] site.pp: Fix insetup::serviceops

https://gerrit.wikimedia.org/r/1183124

Change #1183124 merged by Clément Goubert:

[operations/puppet@production] site.pp: Fix insetup::serviceops

https://gerrit.wikimedia.org/r/1183124

@Clement_Goubert i think their might be a mismatch regarding the site.pp on this one. I am not sure exactly what it is. we had a similar issue in T400195. Could you check it for me when you have a chance? thanks!

Just merged CR should fix it

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm executed with errors:

  • deploy2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console deploy2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm completed:

  • deploy2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509021336_jhancock_2078355_deploy2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jhancock.wm updated the task description. (Show Details)

@Clement_Goubert @jasmine_ this is complete and ready for y'all!