Page MenuHomePhabricator

Q4:rack/setup/install cloudcephosd10[39-41]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcephosd10[39-41]

Hostname / Racking / Installation Details

Hostnames: Hostnames: cloudcephosd10[39-41].eqiad.wmnet
Racking Proposal: All three in E4. If we need to remove older osds to make room first, coordinate with andrew or dcaro
Networking Setup: 2 x 10G interfaces per server, connected to cloudswitches. Check other hosts (e.g. cloudcephosd1012) for switch config specifics.
Partitioning/Raid: SW raid mirror for the two smaller OS drives, other drives can be left unpartitioned for ceph management.
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: David Caro

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1039
  • Receive in system on procurement task T361366 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1040
  • Receive in system on procurement task T361366 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1041
  • Receive in system on procurement task T361366 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr
Resolveddcaro

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

cloudcephosd1039
2nd cable serial#20220008 port 1
cloudcephosd1040
2nd cable serial#20220043 port 5
cloudcephosd1041
2nd cable serial#20220011 port 7

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1040 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1040.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1040 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407011432_jclark_2158300_cloudcephosd1040.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1039 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407011638_jclark_2180488_cloudcephosd1039.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Jclark-ctr updated the task description. (Show Details)

cloudcephosd1039
2nd cable serial#20220008 port 1
cloudcephosd1040
2nd cable serial#20220043 port 5
cloudcephosd1041
2nd cable serial#20220011 port 7

I added these links in Netbox now and set them up for the cloud-storage-e4 vlan untagged. Also changed the primary ports to trunk the cloud-private-e4 vlan.

ayounsi subscribed.

https://netbox.wikimedia.org/extras/scripts/results/78992/
cloudcephosd1039 (WMF11571) /dcim/devices/5296/ Primary IPv6 missing DNS name
I guess the skip IPv6 box got checked by mistake, could someone add the host's FQDN to https://netbox.wikimedia.org/ipam/ip-addresses/17171/ (similar to https://netbox.wikimedia.org/ipam/ip-addresses/17159/) then run the sre.dns.netbox cookbook ?

I got this when trying to set the fqdn (checked others that have the fqdn set on the ipv6, and they don't have the role set, maybe a new requirement?):

image.png (509×1 px, 38 KB)