Page MenuHomePhabricator

replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev
Closed, ResolvedPublic

Description

T382220 suggests that cloudgw1002 might be having hardware issues, and @aborrero suggests that we replace it.

We have several servers already racked from a cancelled dev experiment that we could use to replace this workload without ordering more hardware.

Existing cloudgw servers:

cloudgw1001: C8; 8 cores, 32GB RAM, 2x 10GB ports, 2x240 sssd, purchased march 2021

cloudgw1002: D5; 8 cores, 32GB RAM, 2x 10GB ports, 2x240 sssd, purchased march 2021

Unused cloud-dev servers:

cloudnet1007-dev: E4; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

cloudnet1008-dev: F4; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

cloudcontrol1008-dev: D5; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

cloudcontrol1009-dev: E4; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

cloudcontrol1010-dev: F4; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

I propose that we replace both cloudgw100[12] servers (not at the same time, of course) with renamed cloudnet100[78] boxes.

There are a couple of caveats:

  1. Does it matter that the replacement servers are in different racks? Pinging @aborrero and @cmooney for an answer
  1. Is renaming servers in place in a datacenter so awful that we should never ever do it? Pinging @RobH for an answer

Related Objects

Event Timeline

Does it matter that the replacement servers are in different racks? Pinging @aborrero and @cmooney for an answer

Yeah we'll need to move them. The two cloudgw's should be in C8 and D5 in eqiad. If placed in E4/F4 the flow of traffic to and from those cabs to traverse the cloudgw would be massively sub-optimal and probably lead to congestion.

cloudnet1007-dev: E4; 2x12 cores, 64GB RAM, 2x 10GB ports, 4x960 ssd, purchased august 2023

If the CPUs are running at the same speed going from 8 to 12 does help. However the dual-socket system is a more complex beast, we will need to ensure that all the NIC RX/TX queues are assigned to CPU cores on the socket connected to the same PCIe as the NIC. Not sure if the kernel will do that by default or not. Also definitely worth disabling some of those offloads.

Created T382412 about relocating the cloudnet-dev servers.

Servers can be renamed, but there are a few places that may cause issues if they are not updated.

  • Update the dns name for the IP and the mgmt IP dns entires (ipv4 and ipv6) and run dns update cookbook.
  • Update the name in netbox for the server hostname.
    • When the hostname is updated in netbox, a sub-task for the #ops-sitename queue should be created to apply new hostname labels on the front and back of the server. This is often forgotten, and can lead to confusion when troubleshooting hosts later.

There are even some directions on the lifecycle page and a rename cookbook that I have not personally run so not sure if it is accurate but I'd assume so since automation tends to keep their sections of the lifecycle page well updated. https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

The main thing folks overlook outside of DC ops is the mgmt IP dns and the physical hostname labels we have to update.

reminder: verify VLAN trunk on the NIC of the cloudgw servers.

Change #1114997 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw1003: take over cloudgw1001

https://gerrit.wikimedia.org/r/1114997

Change #1114998 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw1004: take over cloudgw1002

https://gerrit.wikimedia.org/r/1114998

fnegri assigned this task to aborrero.

DNS changes:

Then run:

aborrero@cumin1002:~ $ sudo cookbook sre.dns.netbox -t T382356 'cloudgw updates'

Mentioned in SAL (#wikimedia-cloud) [2025-02-04T12:48:44Z] <arturo> replacing cloudgw1002 with cloudgw1004 - T382356

Change #1114998 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw1004: take over cloudgw1002

https://gerrit.wikimedia.org/r/1114998

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bookworm executed with errors:

  • cloudgw1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudgw1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bullseye

we decided to reimage the new hosts clodugw1004 and cloudgw1003 on bullseye rather than bookworm, and give a test to the puppet profile on bookworm first on codfw1dev.

Change #1117189 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudgw100[34]: specify puppet 7

https://gerrit.wikimedia.org/r/1117189

Change #1117189 abandoned by Andrew Bogott:

[operations/puppet@production] cloudgw100[34]: specify puppet 7

Reason:

It's set in the cloudgw role already.

https://gerrit.wikimedia.org/r/1117189

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1002.eqiad.wmnet with OS bookworm completed:

  • cloudgw1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502041317_andrew_1202782_cloudgw1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bullseye completed:

  • cloudgw1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502041327_aborrero_1214295_cloudgw1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Now the two hosts are: cloudgw1004 (active), cloudgw1001 (standby). 1002 is ready for decom, and 1003 is ready to be put into service once we're confident about 1004.

cloudgw1004 took over cloudgw1002 successfully, and is now sustaining high traffic normally.

We will continue with cloudgw1001 in a couple of weeks.

Change #1120548 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudgw1003: replace cloudgw1001

https://gerrit.wikimedia.org/r/1120548

Change #1120548 abandoned by Andrew Bogott:

[operations/puppet@production] cloudgw1003: replace cloudgw1001

https://gerrit.wikimedia.org/r/1120548

Change #1114997 merged by Andrew Bogott:

[operations/puppet@production] cloudgw1003: take over cloudgw1001

https://gerrit.wikimedia.org/r/1114997

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • New OS is bookworm but bullseye was requested
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudgw1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1001.eqiad.wmnet with OS bookworm completed:

  • cloudgw1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502191210_andrew_1575642_cloudgw1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bullseye completed:

  • cloudgw1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502191250_andrew_1589102_cloudgw1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud) [2025-02-19T13:26:48Z] <arturo> manual failover of cloudgw1004 to cloudgw1003 T382356