Page MenuHomePhabricator

Migrate servers in codfw racks D1 & D2 from asw to lsw
Closed, ResolvedPublic

Description

Currently scheduled for Thurs Sept 12th 2024 16:00 UTC

As part of the scheduled refresh of switch equipment in codfw rows C and D we need to move the network connections for servers in racks D1 and D2 from the old to new switch.

Hosts in this rack are managed by the following teams:

Collaboration Services
Core Platform
Data Persistence
Data Platform
Infrastructure Foundations
Machine Learning
Observability
Search Platform
ServiceOps
Traffic

A full list of the specific hosts can be found below. We will use the sheet to plan the moves and co-ordinate with other SRE teams on actions required to ensure things go smoothly:

https://docs.google.com/spreadsheets/d/16xoZuDeC_-o6s70uEMnvdgn4BlT1f8__WPYprRuduIA#gid=597577091

Server links will be moved one-by-one from old to the new switch. So no two hosts will be offline at once.

Based on previous experience each host is likely to only lose comms for ~10 seconds. It is inevitable that a small number of the new cables do not work, however, or there is some minor glitch in the move. So it is possible in an edge case that a host will be offline for 2-3 minutes. On previous occasions this happened with about 1 out of 20 hosts.

Event Timeline

cmooney triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The server lists2001 mentioned here for Collaboration Services is standby and therefore ok to do anytime.

These racks have the following Swift/Ceph nodes:

  • ms-fe2012 moss-fe2002 thanos-fe2003 (need depool beforehand / pool afterwards)
  • ms-be2061 ms-be2065 ms-be2069 ms-be2079 (need replication / dispersion check afterwards)

I'm on leave 9-13 September, are you OK to handle these please @Eevans ? moss-fe2002 is an apus node, but it's the same procedure (ssh in and do sudo depool before, sudo pool after) as the other frontends.

These racks have the following Swift/Ceph nodes:

  • ms-fe2012 moss-fe2002 thanos-fe2003 (need depool beforehand / pool afterwards)
  • ms-be2061 ms-be2065 ms-be2069 ms-be2079 (need replication / dispersion check afterwards)

I'm on leave 9-13 September, are you OK to handle these please @Eevans ? moss-fe2002 is an apus node, but it's the same procedure (ssh in and do sudo depool before, sudo pool after) as the other frontends.

Yes, I can take it.

I will want to stop ms backups at codfw for backup2011 before it happens. No big deal if I don't do it (just some backups will be marked as failed and probably retried later), but that way we avoid extra failures.

ES replication source in the path has been moved (T374592), all remaining hosts are depoolable

Mentioned in SAL (#wikimedia-operations) [2024-09-12T15:26:21Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: T373102 codfw maintenance]

I've stopped codfw media backups.

@cmooney Would it be possible to get preferencial time on maintenance time for ms-backup2002 and backup2011 🥺? Not above any of my workmates that wrote above or reached you, but above the other hosts that require no further attention? That way I can start the backups again ASAP before the end of my day and forget about it while your team keep doing the rest of the hosts, trying to minimize the disruption time.

Mentioned in SAL (#wikimedia-operations) [2024-09-12T15:29:05Z] <claime> Depooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102

Mentioned in SAL (#wikimedia-operations) [2024-09-12T15:42:39Z] <urandom> depooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102

Icinga downtime and Alertmanager silence (ID=bb570977-8737-4373-95ac-3765685f6e5e) set by cmooney@cumin1002 for 0:40:00 on 21 host(s) and their services with reason: Move server uplinks codfw racks D1

db[2128,2139,2151,2170-2171,2211-2212].codfw.wmnet,dns2006.wikimedia.org,es[2033-2034,2039].codfw.wmnet,ganeti[2015,2025].codfw.wmnet,kafka-main2009.codfw.wmnet,kubernetes[2044-2045].codfw.wmnet,lists2001.wikimedia.org,ml-staging2002.codfw.wmnet,pc2014.codfw.wmnet,sessionstore2006.codfw.wmnet,wcqs2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=5073d83c-c18b-41a0-aa78-a6da63b209f9) set by cmooney@cumin1002 for 0:40:00 on 21 host(s) and their services with reason: Move server uplinks codfw racks D2

backup[2001,2011].codfw.wmnet,cephosd2003.codfw.wmnet,db[2236-2237].codfw.wmnet,elastic[2104-2105].codfw.wmnet,ganeti2041.codfw.wmnet,logging-hd2003.codfw.wmnet,lvs2014.codfw.wmnet,ml-serve2011.codfw.wmnet,moss-fe2002.codfw.wmnet,ms-backup2002.codfw.wmnet,ms-be[2061,2065,2069,2079].codfw.wmnet,ms-fe2012.codfw.wmnet,thanos-fe2003.codfw.wmnet,wdqs[2015,2025].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-12T16:01:26Z] <topranks> move server uplinks in codfw rack D1 from asw-d1-codfw to lsw1-d1-codfw T373102

Mentioned in SAL (#wikimedia-operations) [2024-09-12T16:18:11Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: [end] T373102 codfw maintenance]

Everything moved successfully, all ports up on the new switch and everything responding to ping again.

Mentioned in SAL (#wikimedia-operations) [2024-09-12T16:31:05Z] <claime> Repooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102

Mentioned in SAL (#wikimedia-operations) [2024-09-12T16:32:16Z] <urandom> pooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102

Mentioned in SAL (#wikimedia-operations) [2024-09-12T16:36:22Z] <topranks> disable ports for now unused ports on asw-d1-codfw and asw-d2-codfw T373102

cmooney claimed this task.