Page MenuHomePhabricator

Q4:rack/setup/install cloudcephosd10[48-51]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcephosd10[48-51]

Hostname / Racking / Installation Details

Hostnames: Sequential names starting with cloudcephosd1048.eqiad.wmnet, up through cloudcephosd1051 if I'm counting right.
Racking Proposal: E4/F4 even split.
Networking Setup: # of Connections:2 - Speed: 25G. - VLAN: plug into the cloudsw in the rack; 2 x 25 ports per server. Each host should have its primary on cloud-hosts1-eqiad and its secondary on cloud-storage1-eqiad
OS Distro: Bullseye
Boot Method: Legacy BIOS
Sub-team Technical Contact: Andrew or Dcaro

Per host setup checklist

cloudcephosd1048
  • Receive in system on procurement task TT389851 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1049
  • Receive in system on procurement task TT389851 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1050
  • Receive in system on procurement task TT389851 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1051
  • Receive in system on procurement task TT389851 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1051 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506022254_jclark_3134980_cloudcephosd1051.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1048 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506031316_jclark_3598631_cloudcephosd1048.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Jclark-ctr updated the task description. (Show Details)
Andrew reassigned this task from Jclark-ctr to cmooney.
Andrew added a subscriber: Jclark-ctr.

The cookbook is rejecting cloudcephosd1048 for failing network tests, so I assume that this host needs some kind exotic netbox work done.

Here's a working osd node (cloudvirt1040):

2: eno12399np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether d4:04:e6:c1:d8:e0 brd ff:ff:ff:ff:ff:ff
    altname enp42s0f0np0
    inet 10.64.148.18/24 brd 10.64.148.255 scope global eno12399np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:11c:10:64:148:18/64 scope global 
       valid_lft 2592000sec preferred_lft 604800sec
    inet6 fe80::d604:e6ff:fec1:d8e0/64 scope link 
       valid_lft forever preferred_lft forever

And here's 1048:

2: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 6c:92:cf:a1:f1:e0 brd ff:ff:ff:ff:ff:ff
    inet 10.64.148.29/24 brd 10.64.148.255 scope global enp10s0f0np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:11c:10:64:148:29/64 scope global 
       valid_lft 2591997sec preferred_lft 604797sec
    inet6 fe80::6e92:cfff:fea1:f1e0/64 scope link 
       valid_lft forever preferred_lft forever

Among other things, the connection speed for 1048 looks pretty wrong; we were hoping this would have a 25G connection.

Among other things, the connection speed for 1048 looks pretty wrong; we were hoping this would have a 25G connection.

Yeah it's connected with a 10G DAC. At the top of the task that is what's requested. You'll need to discuss with dc-ops about changing that to a 25G link if it needs to.

The cookbook is rejecting cloudcephosd1048 for failing network tests

Which cookbook is this? Certainly the Ceph nodes currently usually get two ports connected, per your example 1040 has two configured and status UP:

cmooney@cloudcephosd1040:~$ ip -br addr show scope global | grep UP 
eno12399np0      UP             10.64.148.18/24 2620:0:861:11c:10:64:148:18/64 
eno12409np1      UP             192.168.5.10/24

Whereas 1048 does not:

cmooney@cloudcephosd1048:~$ ip -br addr show scope global | grep UP 
enp10s0f0np0     UP             10.64.148.29/24 2620:0:861:11c:10:64:148:29/64

Usually that would be requested in the task and dc-ops would comment back to say what switch port the second link was connected to, after which we could add it to Netbox manually. I think as the top of the task says '# of connections: 1` that's not happened though. So if these are gonna be regular ceph hosts we need to get the second port on each connected to the switch, dc ops to tell us what ports are used, after which we can configure that second link.

Sorry @Jclark-ctr, I've made a bit of a mess of this.

Ideally each of these hosts would have 2x25G connections, each connected to a cloudswitch. Are there enough ports available to do that?

Looks like I'm getting ahead of things a bit. We definitely do need 2 connections per host, but it's unclear on if we're skipping to 25G or starting with 10G and seeing how it goes. Will consult with @dcaro when he's back from PTO.

One issue with 25G is that we don't have the dacs for that yet, they're expected to arrive sometime before the end of June.

Andrew added a subscriber: cmooney.

@Jclark-ctr, we would like to wait until the 25G dacs come in, and then have each of these hosts reconnected to 25G ports, 2 ports per host. The primary interface will be on cloud-hostsand the second interface cloud-private.

@Andrew Would it be possible to use a single 25G uplink (cf. T325531: ceph: test and decide 1 network interface setup) to make it better with automation and overall design (all reasons on Wikitech) ? Then revisit the day we get close to saturating it.

@Andrew Would it be possible to use a single 25G uplink (cf. T325531: ceph: test and decide 1 network interface setup) to make it better with automation and overall design (all reasons on Wikitech) ? Then revisit the day we get close to saturating it.

That should be possible as long as I can get support with refactoring our puppet setup. Pinging @dcaro to see if he disagrees.

@Andrew Would it be possible to use a single 25G uplink (cf. T325531: ceph: test and decide 1 network interface setup) to make it better with automation and overall design (all reasons on Wikitech) ? Then revisit the day we get close to saturating it.

That should be possible as long as I can get support with refactoring our puppet setup. Pinging @dcaro to see if he disagrees.

That conflicts with the HA setup we want to get to no? (two links, one to each switch)

There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more.

There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more.

That's ok, I would go carefully though, we have not tested single-nic OSDs, we will need to setup both networks on them (internal storage traffic, and external), and monitor that they behave ok at the host level too.

@dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x ports on cloudsw1-f4-eqiad.
The least disruptive option would be to relocate eth10 on cloudvirt1073 to eth13.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-09T01:12:35Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1073.eqiad.wmnet' (T394333)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-09T01:26:00Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1073.eqiad.wmnet' (T394333)

@dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x ports on cloudsw1-f4-eqiad.
The least disruptive option would be to relocate eth10 on cloudvirt1073 to eth13.

I've drained cloudvirt1073 so the cables can be moved any time. You'll want to downtime it in alertmanager first.

Change #1167564 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] WMF Plugin: do not process disabled ports for block speed setting

https://gerrit.wikimedia.org/r/1167564

That should be possible as long as I can get support with refactoring our puppet setup. Pinging @dcaro to see if he disagrees.

Ok yep we need to look into how to achieve this. The network setup on the wire will mean the "storage" vlan will be trunked to a separate vlan interface over the main physical port. The vlans used for storage are as follows (racks C8 & D5 share the same subnet/vlan due to the way things evolved historically, another small complication):

SiteRackVlan IDSubnetVlan Name
eqiadC81106192.168.4.0/24cloud-storage1-eqiad
eqiadD51106192.168.4.0/24cloud-storage1-eqiad
eqiadE41121192.168.5.0/24cloud-storage1-e4-eqiad
eqiadF41122192.168.6.0/24cloud-storage1-f4-eqiad
codfwB12106192.168.4.0/24cloud-storage1-b-codfw

The IPs configured for the storage network are (afaik) set up in puppet here, which also references the interface the storage IP goes on (currently the physical second link).

Probably naive a little but this may be a way we can proceed:

  • Ensure the storage vlan is trunked to the primary interface of all the cloudcpehosd's on the switch side (non-disruptive, netops can do it)
  • Create puppet patches to add the new vlan-subinterface for the appropriate vlan id as a child of the main physical
    • Similar to how the cloud-private is added on others
    • Merely creating the interface - with no IPs on it - should be possible across the fleet without affecting any traffic paths
  • Starting with these new hosts we can then change cluster network 'iface' in hiera from the physical second port to the new vlan interface
    • We also need to make sure the aggregate 192.168.0.0/16 route is present (it should be)
  • Once all cloudcephosd hosts have the 'iface' for the cluster network as the sub-int we can remove the second physical links

TBD exactly but that might be a rough idea of how to approach. Happy to discuss.

For the time being I have trunked the storage vlan to cloudcephosd1049 and cloudcephosd1050 on their 25G ports so they are set up as we intend network-wise (those links are up and connected fwiw).

1050 and 1051 are now connected and ports up too.

cmooney@cloudsw1-f4-eqiad> show interfaces descriptions | match "0/0/8|0/0/9" 
et-0/0/8        up    up   cloudcephosd1050 {#B00263}
et-0/0/9        up    up   cloudcephosd1051 {#B00261}

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye

@elukey i am having issues with 2 servers both fail to reimage after switching to 25g dac . cloudcephosd1050 cloudcephosd1051

 self.dns.resolve_ips(dns_name)  # Will raise if not valid
File "/usr/lib/python3/dist-packages/wmflib/dns.py", line 124, in resolve_ips
  raise DnsNotFoundError(f"Record A or AAAA not found for {name}")

wmflib.dns.DnsNotFoundError: Record A or AAAA not found for cloudcephosd1050.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1048 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507091238_jclark_110721_cloudcephosd1048.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1049 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507091241_jclark_111126_cloudcephosd1049.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1050.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1051 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1051.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1050 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1050.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1051 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1051.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Full stack trace:

2025-07-09 12:36:49,306 jclark 138654 [INFO] Completed command 'puppet lookup --render-as s --compile --node cloudcephosd1051.eqiad.
wmnet profile::puppet::agent::force_puppet7 2>/dev/null'
2025-07-09 12:36:49,308 jclark 138654 [INFO] Lookup result for force_puppet7: true
2025-07-09 12:37:00,496 jclark 138541 [ERROR] Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 129, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 261, in __init__
    self._validate()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 373, in _validate
    self.dns.resolve_ips(dns_name)  # Will raise if not valid
  File "/usr/lib/python3/dist-packages/wmflib/dns.py", line 124, in resolve_ips
    raise DnsNotFoundError(f"Record A or AAAA not found for {name}")
wmflib.dns.DnsNotFoundError: Record A or AAAA not found for cloudcephosd1050.eqiad.wmnet
2025-07-09 12:37:01,590 jclark 138654 [ERROR] Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 129, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 261, in __init__
    self._validate()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 373, in _validate
    self.dns.resolve_ips(dns_name)  # Will raise if not valid
  File "/usr/lib/python3/dist-packages/wmflib/dns.py", line 124, in resolve_ips
    raise DnsNotFoundError(f"Record A or AAAA not found for {name}")
wmflib.dns.DnsNotFoundError: Record A or AAAA not found for cloudcephosd1051.eqiad.wmnet

@Jclark-ctr I'll try to investigate what's happening tomorrow!

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1051 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1051.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1050 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507092013_jclark_574577_cloudcephosd1050.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1051 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507092017_jclark_575653_cloudcephosd1051.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr updated the task description. (Show Details)

@Jclark-ctr IIUC it was a temporary failure right?

I created the below task to continue the discussion of how we set up the interfaces for these hosts, and copied my comments from above.

T399180: Cloudcephosd: migrate to single network uplink

@Jclark-ctr IIUC it was a temporary failure right?

yes that was a temporary failure caused by my self Thanks for checking

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1048 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507101928_jclark_1601618_cloudcephosd1048.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1051 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507111252_jclark_2217355_cloudcephosd1051.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1049 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507111256_jclark_2217474_cloudcephosd1049.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1050 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507111303_jclark_2217425_cloudcephosd1050.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@Jclark-ctr as discussed in our call on Tuesday we will be connecting the second SFP port on these hosts to the switches too, as we need to solve the MTU issue before proceeding with T399180: Cloudcephosd: migrate to single network uplink.

Do you have four 25G-DACs to do this? If so please cable them up and let me know the ports I'll add the config on the switch, it'll have to be the remaining two 25G in the blocks we used for the primary. Thanks.

Jclark-ctr mentioned this in Unknown Object (Task).Jul 17 2025, 3:57 PM

Change #1167564 merged by Cathal Mooney:

[operations/software/homer/deploy@master] WMF Plugin: do not process disabled ports for block speed setting

https://gerrit.wikimedia.org/r/1167564

Change #1174022 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add puppet role and preseed for cloudcephosd1052

https://gerrit.wikimedia.org/r/1174022

This comment was removed by Andrew.

@Jclark-ctr are we waiting on more DACs before we can move ahead with these?

Change #1174022 merged by Andrew Bogott:

[operations/puppet@production] Add puppet role and preseed for cloudcephosd1052

https://gerrit.wikimedia.org/r/1174022

@Jclark-ctr are we waiting on more DACs before we can move ahead with these?

We are awaiting the delivery from {T399869}

@cmooney Would you be able to assist with setting up eth1 link? servers are already imaged on eth0. I believe this ticket will be able to be closed after port is configured and enabled on switch

cloudcephosd1048 eth1
cloudsw1-e4-eqiad CableID# 800342 Port# 17

cloudcephosd1049 eth1
cloudsw1-e4-eqiad CableID# 800344 Port# 18

cloudcephosd1050 eth1
cloudsw1-f4-eqiad CableID# 800341 Port# 10

cloudcephosd1051 eth1
cloudsw1-f4-eqiad CableID# 800343 Port# 11

@Jclark-ctr that is done now and all four host's second ports are connected and running at 25G now.

@Andrew you can proceed to set them up fully now I think.