Page MenuHomePhabricator

Q2:rack/setup/install mwlog1003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of mwlog1003

Hostname / Racking / Installation Details

Hostnames: mwlog1003
Racking Proposal: no preference
Networking Setup: # of Connections:1 - Speed:10G (only 1G needed but 10G fine). - VLAN:Private
OS Distro: Bookworm
Boot Method: BIOS
Sub-team Technical Contact: herron

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

mwlog1003:
  • Receive in system on procurement task T404793 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

This server is ready to be imaged pending @herron updating puppet

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

image.png (846×1 px, 307 KB)

Server is currently failing to image. I’ve reached out to Herron to review the RAID configuration in Puppet.

Change #1229152 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog: update partman config

https://gerrit.wikimedia.org/r/1229152

Change #1229152 merged by Herron:

[operations/puppet@production] mwlog: update partman config

https://gerrit.wikimedia.org/r/1229152

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Change #1229592 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Troubleshooting install on mwlog1003

https://gerrit.wikimedia.org/r/1229592

Change #1229592 merged by Papaul:

[operations/puppet@production] Troubleshooting install on mwlog1003

https://gerrit.wikimedia.org/r/1229592

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with errors:

  • mwlog1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm

Change #1229807 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update partman recipe for mwlog1003

https://gerrit.wikimedia.org/r/1229807

Change #1229807 merged by Papaul:

[operations/puppet@production] Update partman recipe for mwlog1003

https://gerrit.wikimedia.org/r/1229807

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm completed:

  • mwlog1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601220127_pt1979_2942014_mwlog1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@herron Hello the default partman recipe for mwlog is not working with new servers so to install mwlog1003 I created a new line for it in the preseed.ymal file.

'mwlog1003':
   - partman/standard.cfg
   - partman/raid10-4dev.cfg

The out put of mdadm -D /dev/md0 is the same as for mwlog1002. Please let me know if all looks good on the server so I can remove the above changes from the pressed.yaml file so that the server defaults to "partman/custom/reuse-lvm-root-4dev.cfg"
Thanks

Thanks so much for sorting through this @Papaul and @Jclark-ctr! Yes looks good to me, ready to revert to the reuse variant. Thanks again!

Change #1230369 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Revert to use the old partman for mwlog nodes

https://gerrit.wikimedia.org/r/1230369

Change #1230369 merged by Papaul:

[operations/puppet@production] Revert to use the old partman for mwlog nodes

https://gerrit.wikimedia.org/r/1230369

@herron complete closing the task. Thank you.