Page MenuHomePhabricator

Decom asw-a-codfw switch stack
Closed, ResolvedPublic

Description

Now that all servers have been moved from the old asw-aX-codfw devices in row A to lsw1-aX-codfw independent switches we can now begin the work of decommissioning the old switches.

I'm not 100% on all the steps we need to do here, basing this somewhat on T218734: Decommission asw-a-eqiad.

  • Disable asw-a-codfw <-> ssw1-a1-codfw link [netops]
  • Disable asw-a-codfw <-> ssw1-a8-codfw link [netops]
  • Remove asw-a-codfw <-> ssw1-a1-codfw cable and optics [dc-ops]
  • Remove asw-a-codfw <-> ssw1-a8-codfw cable [dc-ops]
  • Set to decommissioning status in netbox and remove from Homer [netops]
  • Remove from monitoring (LibreNMS/Rancid/Icinga) [netops]
  • Connect console cables to old devices so netops can deprovision over serial [dc-ops]
  • Wipe and power down devices [netops]
  • Disconnect mgmt ports and console ports [dc-ops]
  • Delete mgmt IPs in Netbox [netops]
  • Update / check all console server connections are correct in Netbox [dc-ops]
  • Power down and unrack asw-a-codfw members [dc-ops]
  • Update device status in Netbox [dc-ops]

@ayounsi I'll probably need to get your advice on how to wipe down the devices, I'm guessing we want to untangle the VC first, then do a request system zeroize?

Related Objects

Event Timeline

cmooney triaged this task as Medium priority.Feb 22 2024, 4:32 PM
cmooney created this task.

All interfaces on asw-a-codfw are set to 'disabled' apart from the uplinks to ssw's, and no mac's learnt on SSW side so proceeding to delete those links in Netbox.

cmooney@ssw1-a1-codfw> show ethernet-switching table interface ae0    

MAC database for interface ae0

MAC database for interface ae0.0

{master:0}

Change 1005799 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove definition/config for codfw ssw's ESI-LAG to asw-a-codfw

https://gerrit.wikimedia.org/r/1005799

Ok I've removed the configuration for the ESI-LAG between the codfw spine switches and asw-a-codfw both sides now.

DC-Ops you can go ahead and remove the cables and optics. For reference these are the two cable runs (now disconnected in netbox):

image.png (442×497 px, 30 KB)

image.png (442×497 px, 30 KB)

Change 1005799 merged by jenkins-bot:

[operations/homer/public@master] Remove definition/config for codfw ssw's ESI-LAG to asw-a-codfw

https://gerrit.wikimedia.org/r/1005799

Setting to "decommissioning" will cause automation to remove the mgmt DNS record.

I suggest that we do this to avoid back and forth between teams :

  • Netops Steps:
    1. remove it from monitoring
    2. run the zeroize command over mgmt
    3. Update Netbox (set it to decommissioning, remove everything that the wipe removed)
    4. run the dns and hiera cookbook
    5. remove the last bits from Homer
    6. Hand it over to DCops
  • DCops
    1. Remove the cables (inc. power)
    2. Update Netbox (eg. remove the cables)
    3. Update / check all console server connections are correct in Netbox [dc-ops]
    4. Verify no Netbox alerts
    5. Optionnally unrack it at your leisure

After a quick discussion on irc I think we can't wipe the config for every unit in the VC over ssh to the master. So probably easiest to do that via serial console.

@Jhancock.wm would it be possible to connect the serial console in racks A1-A8 to the old switch in each rack? As far as I'm aware they were moved to the lsw's to help us get those installed, but we need to move back now to wipe the old ones.

When we're done we can move back to the lsw's when removing the old switches from the rack, and double-check Netbox shows that's where they land (plus make any port description changes on the opengear console itself).

In the meantime I will work on removing the device(s) from management/rancid etc. etc.

@cmooney they're on the old asw switches. Let me know when you want to move them back to the new lsw.

FYI it's alerting for one of its PSU being down, but we don't really care anymore :

asw-a-codfw> show system alarms
1 alarms currently active
Alarm time Class Description
2024-03-16 09:20:23 UTC Major FPC 6 PEM 1 is not powered

I downtimed the stack in Icinga/LibreNMS for 1 month

Change 1012402 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Remove asw-a-codfw from monitoring

https://gerrit.wikimedia.org/r/1012402

Change 1012402 merged by Papaul:

[operations/puppet@production] Remove asw-a-codfw from monitoring

https://gerrit.wikimedia.org/r/1012402

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

Zeroize done on asw-a1
setups:

  • delete the member from the master
  • Disconnect both cable going to asw-a2 and asw-a7
  • while login into to console run the zeroize command

Zeroize done on asw-a3 and asw-a4

Zeroize done on all the old switches in role a

Change 1012705 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Remove asw-a from homer

https://gerrit.wikimedia.org/r/1012705

Change 1012705 merged by Papaul:

[operations/homer/public@master] Remove asw-a from homer

https://gerrit.wikimedia.org/r/1012705

Removed all old cables and unracked 4 switches out of 8

FYI it's alerting for one of its PSU being down, but we don't really care anymore :

asw-a-codfw> show system alarms
1 alarms currently active
Alarm time Class Description
2024-03-16 09:20:23 UTC Major FPC 6 PEM 1 is not powered

I downtimed the stack in Icinga/LibreNMS for 1 month

Thanks. I didn't realise @Papaul had a plan of action already, I'd paused any work on this as requested pending your return.

Not to worry I'll tidy up the old bits today.

Papaul claimed this task.
Papaul updated the task description. (Show Details)

complete