Page MenuHomePhabricator

Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001
Closed, ResolvedPublic

Description

The TL;DR is that when we reimaged mw1372 as wikikube-ctrl1001 (after the addition of a 10G card to satisfy the increased network bandwidth needs) we figured out that it was part of the mw1349-mw1413 batch that is meant to be decommisioned with T375842: decommission mw[1349-1413]. We 'll be able to utilize it for as long as we want per {T375842#10183392} and followups, but procuring a replacement will take a while. In the meantime we might be able to just reimage a current wikikube-worker node from the batch that was meant to replace that hardware.

The new boxes already have 10G cards so that wouldn't be an issue from the side of the hardware

We 'll need to find one of the 65 new boxes that resides on a rack in eqiad that has a spare 10G ports, hence tagging netops

Updates

We picked wikikube-worker1290 in random from the list in T379790#10318533

The 65 machines, with their per rack/rackrow groupings are belowlisted bellow

===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1280-1283,1301].eqiad.wmnet                                                                                               
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e6-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(4) wikikube-worker[1276-1279].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e5-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1296-1300].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f7-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(6) wikikube-worker[1292-1295,1303-1304].eqiad.wmnet                                                                                          
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f6-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(6) wikikube-worker[1270-1275].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-d-eqiad.mgmt.eqiad.wmnet                                                                                               
===== NODE GROUP =====                                                                                                                        
(4) wikikube-worker[1288-1291].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f5-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1250-1259].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-b-eqiad.mgmt.eqiad.wmnet                                                                                               
===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1284-1287,1302].eqiad.wmnet                                                                                               
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e7-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1260-1269].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-c-eqiad.mgmt.eqiad.wmnet                                                                                               
===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1240-1249].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-a-eqiad.mgmt.eqiad.wmnet
  • wikikube-ctrl1002 is at asw2-b-eqiad.mgmt.eqiad.wmnet
  • ikikube-ctrl1003 is at asw2-c-eqiad.mgmt.eqiad.wmnet

and we want to avoid having >1 of them in the same network availability zone. Which rules out the following two groups

===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1250-1259].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-b-eqiad.mgmt.eqiad.wmnet                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1260-1269].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-c-eqiad.mgmt.eqiad.wmnet

leaving

===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1280-1283,1301].eqiad.wmnet                                                                                               
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e6-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(4) wikikube-worker[1276-1279].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e5-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1296-1300].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f7-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(6) wikikube-worker[1292-1295,1303-1304].eqiad.wmnet                                                                                          
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f6-eqiad.mgmt.eqiad.wmnet                                                                                              
===== NODE GROUP =====                                                                                                                        
(6) wikikube-worker[1270-1275].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-d-eqiad.mgmt.eqiad.wmnet                                                                                               
===== NODE GROUP =====                                                                                                                        
(4) wikikube-worker[1288-1291].eqiad.wmnet                                                                                                    
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-f5-eqiad.mgmt.eqiad.wmnet                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                        
(5) wikikube-worker[1284-1287,1302].eqiad.wmnet                                                                                               
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      lsw1-e7-eqiad.mgmt.eqiad.wmnet                                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                        
(10) wikikube-worker[1240-1249].eqiad.wmnet                                                                                                   
----- OUTPUT of 'lldpctl|grep SysName' -----                                                                                                  
    SysName:      asw2-a-eqiad.mgmt.eqiad.wmnet

If we can secure a 10G port for one of the machines in the above and cable it, we can probably reimage that host as wikikube-ctrl1004 and decom wikikube-ctrl1001

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Polling Netbox to find what switch each of those are connected to it appears none of them in rows A-D are connected to a 10G capable switch.

There are plenty of options in rows E-F however:

wikikube-worker1276  lsw1-e5-eqiad qfx5120-48y-afi2
wikikube-worker1277  lsw1-e5-eqiad qfx5120-48y-afi2
wikikube-worker1278  lsw1-e5-eqiad qfx5120-48y-afi2
wikikube-worker1279  lsw1-e5-eqiad qfx5120-48y-afi2
wikikube-worker1280  lsw1-e6-eqiad qfx5120-48y-afi2
wikikube-worker1281  lsw1-e6-eqiad qfx5120-48y-afi2
wikikube-worker1282  lsw1-e6-eqiad qfx5120-48y-afi2
wikikube-worker1283  lsw1-e6-eqiad qfx5120-48y-afi2
wikikube-worker1284  lsw1-e7-eqiad qfx5120-48y-afi2
wikikube-worker1285  lsw1-e7-eqiad qfx5120-48y-afi2
wikikube-worker1286  lsw1-e7-eqiad qfx5120-48y-afi2
wikikube-worker1287  lsw1-e7-eqiad qfx5120-48y-afi2
wikikube-worker1288  lsw1-f5-eqiad qfx5120-48y-afi2
wikikube-worker1289  lsw1-f5-eqiad qfx5120-48y-afi2
wikikube-worker1290  lsw1-f5-eqiad qfx5120-48y-afi2
wikikube-worker1291  lsw1-f5-eqiad qfx5120-48y-afi2
wikikube-worker1292  lsw1-f6-eqiad qfx5120-48y-afi2
wikikube-worker1293  lsw1-f6-eqiad qfx5120-48y-afi2
wikikube-worker1294  lsw1-f6-eqiad qfx5120-48y-afi2
wikikube-worker1295  lsw1-f6-eqiad qfx5120-48y-afi2
wikikube-worker1296  lsw1-f7-eqiad qfx5120-48y-afi2
wikikube-worker1297  lsw1-f7-eqiad qfx5120-48y-afi2
wikikube-worker1298  lsw1-f7-eqiad qfx5120-48y-afi2
wikikube-worker1299  lsw1-f7-eqiad qfx5120-48y-afi2
wikikube-worker1300  lsw1-f7-eqiad qfx5120-48y-afi2
wikikube-worker1301  lsw1-e6-eqiad qfx5120-48y-afi2
wikikube-worker1302  lsw1-e7-eqiad qfx5120-48y-afi2
wikikube-worker1303  lsw1-f6-eqiad qfx5120-48y-afi2
wikikube-worker1304  lsw1-f6-eqiad qfx5120-48y-afi2

The switches there are QFX5120s, which have a restriction that adjacent ports need to be at the same speed. So for any of them we will need to move from the existing switch port (in a block of 1G ports) to a new one that can do 10G. But I think all the switches in those rows have capacity to do this shouldn't be a problem.

We likely can use the "move server" netbox script to make the changes there, but it was designed for things that were also moving rack, so perhaps if you want to let me know which ones you want to do I can test it and see how it works.

Cool, thanks. In that case, I randomly picked wikikube-worker1290 and let's take it from there.

Cool, thanks. In that case, I randomly picked wikikube-worker1290 and let's take it from there.

Ok cool. I guess step one is if you can depool it so it's ready to be reimaged. Is the plan to rename it to wikikube-ctrl1xxx?

There are a bunch of free ports on the switch anyway. Once it's de-pooled we should be able to rename and adjust netbox to terminate it on xe-0/0/25 of lsw1-f5-eqiad, after which we can ping dc-ops to disconnect from the current port and wire the 10G nic up.

cmooney triaged this task as Medium priority.Nov 18 2024, 3:51 PM

depool host wikikube-worker1290.eqiad.wmnet by akosiaris@cumin1002 with reason: To be reimaged as wikikube-ctrl1004

Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1002 depool for host wikikube-worker1290.eqiad.wmnet completed:

  • wikikube-worker1290.eqiad.wmnet (PASS)
    • Host wikikube-worker1290.eqiad.wmnet depooled from wikikube-eqiad

Change #1092840 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] wikikube: Add wikikube-ctrl1004

https://gerrit.wikimedia.org/r/1092840

Cool, thanks. In that case, I randomly picked wikikube-worker1290 and let's take it from there.

Ok cool. I guess step one is if you can depool it so it's ready to be reimaged.

{{Done}}

Is the plan to rename it to wikikube-ctrl1xxx?

Yes. wikikube-ctrl1004. Patch uploaded already.

There are a bunch of free ports on the switch anyway. Once it's de-pooled we should be able to rename and adjust netbox to terminate it on xe-0/0/25 of lsw1-f5-eqiad, after which we can ping dc-ops to disconnect from the current port and wire the 10G nic up.

SGTM

cmooney added subscribers: VRiley-WMF, Jclark-ctr.

Ok. So I've tested the "move server" script for this host and it's worked as expected. Some values we need to fill in despite the fact they are not changing, such as the switch and the rack unit. The "switch interface" I just picked a free number within a block of 10G ones based on what was already used.

{F57721947 width=500}

With this done I downtimed the host for 4 days to give us some time to do the rest:

sudo cookbook sre.hosts.downtime --days 4 -r "being moved to new port" wikikube-worker1290.eqiad.wmnet

I then set the 'bgp' flag for the host to False in Netbox and ran Homer against lsw1-f5-eqiad (see P71101 for diff)

The reconfiguration of the switch means the host is now unreachable (hence the downtime). Next steps are:

  1. @Jclark-ctr @VRiley-WMF could one of you next time your in the DC remove the 1G link from wikikube-worker1290 in rack F5, and instead connect the first port on its 10G NIC with a 10G DAC cable to port 25 on the same switch in that rack?
  1. Ops-eqiad we also need to relabel the server to wikikube-ctrl1004
  1. @akosiaris I think at any point you can run the sre.hosts.rename cookbook, and merge any puppet patches etc
  1. Finally I think we can kick of a reimage for the newly-renamed server, to properly install it with the new NIC enabled and in its new role.

A 10G connection has been placed into port 25 and connected to wikikube-worker1290.

A 10G connection has been placed into port 25 and connected to wikikube-worker1290.

Thanks Valerie!

Service-ops you should be good to go with the rename & reimage of this one I think.

Cool thanks, I 'll take over this one.

JMeybohm renamed this task from Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 to Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001.Dec 18 2024, 10:19 AM

depool host wikikube-worker1290.eqiad.wmnet by akosiaris@cumin1002 with reason: To be reimaged as wikikube-ctrl1004

Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1002 depool for host wikikube-worker1290.eqiad.wmnet completed:

  • wikikube-worker1290.eqiad.wmnet (PASS)
    • Host wikikube-worker1290.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from wikikube-worker1290 to wikikube-ctrl1004 completed:

  • wikikube-worker1290 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Change #1092840 merged by Alexandros Kosiaris:

[operations/puppet@production] wikikube: Add wikikube-ctrl1004

https://gerrit.wikimedia.org/r/1092840

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-ctrl1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-ctrl1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-ctrl1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-ctrl1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

I 've had to enable PXE boot on the 10G card in the BIOS to get the server to PXE, proceeding.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-ctrl1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-ctrl1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-ctrl1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-ctrl1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm

Change #1106320 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add wikikube-ctrl1004 to etcd SRV records

https://gerrit.wikimedia.org/r/1106320

Change #1106320 merged by Alexandros Kosiaris:

[operations/dns@master] Add wikikube-ctrl1004 to etcd SRV records

https://gerrit.wikimedia.org/r/1106320

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm completed:

  • wikikube-ctrl1004 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202412231202_akosiaris_1810678_wikikube-ctrl1004.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202412231342_akosiaris_1810678_wikikube-ctrl1004.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412231345_akosiaris_1810678_wikikube-ctrl1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1106322 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes

https://gerrit.wikimedia.org/r/1106322

Change #1106322 merged by Alexandros Kosiaris:

[operations/puppet@production] Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes

https://gerrit.wikimedia.org/r/1106322

akosiaris claimed this task.

box reimaged, BGP set up, calico double checked. Resolving, I 'll update the decom task T375842: decommission mw[1349-1413] tomorrow.