Page MenuHomePhabricator

Switch buffer re-partition - Eqiad Row B
Closed, ResolvedPublic

Description

Planned for Tues July 27th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)

Completed on schedule, no issues to report.

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row B, to address tail drops observed on some of the devices, which is causing throughput issues.

This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop, as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".

Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=6&status=active&role=server

Summarizing the physical servers / types:

Server Name / PrefixCountRelevant TeamAction RequiredAction Status
mw39Service OperationsNN/A
db21Data PersistenceN, Consider failing over db1132 to db1117 if the maintenance takes more than expectedN/A
an-worker14Analytics SREsNN/A
cloudvirt12WMCSNN/A
elastic9Search Platform SREsNN/A
ms-be9Data Persistence (Media Storage)NN/A
ganeti6Infrastructure FoundationsNN/A
analytics5Analytics SREsNN/A
wtp5Service OperationsNN/A
cp4Trafficdepool the individual hosts with the depool commandComplete
es4Data PersistenceNN/A
restbase4Service OperationsNN/A
cloudcephmon3WMCSNN/A
cloudcephosd3WMCSNN/A
cloudvirt-wdqs3WMCSNN/A
kubernetes3Service OperationsNN/A
wdqs3Search Platform SREsNN/A
clouddb2WMCS, with support from DBAsNN/A
cloudelastic2WMCS/Search Platform SREsNN/A
cloudnet2WMCSNN/A
dbproxy2Data Persistencedbproxy1014 requires action, dbproxy1015 doesn'tDone
druid2AnalyticsNN/A
mc2Service OperationsNN/A
ores2Machine Learning SREsNN/A
snapshot2Service Operations & Platform EngineeringNN/A
thumbor2Service Operations (& Performance)NN/A
an-conf10011AnalyticsNN/A
an-coord10011Analytics SREsNN/A
an-launcher10021Analytics SREsNN/A
an-master10021Analytics SREsNN/A
an-presto10041Analytics SREsNN/A
aqs10081Analytics SREsNN/A
atlas-eqiad1Infrastructure FoundationsNN/A
authdns10011TrafficManual depool neededComplete
backup10031Data PersistenceNN/A
cloudcontrol10041WMCSNN/A
cloudservices10031WMCSNN/A
conf10051Service OperationsNN/A
dbprov10021Data PersistenceNN/A
dumpsdata10011Service Operations & Platform EngineeringNN/A
gerrit10011Service Operations (Supportive Services) & Release EngineeringNN/A
graphite10041ObservabilityNN/A
kafka-jumbo10031Analytics SREsNN/A
kafka-main10021Analytics
kubestage10021Service OperationsNN/A
labweb10011WMCSNN/A
logstash10111ObservabilityN
lvs10141TrafficFailover to secondary (lvs1016 in row D) by stopping pybal with puppet disabledComplete
maps10021??NN/A
mwmaint10021Service OperationsNN/A
pc10081SRE Data Persistence (DBAs), with support from Platform and PerformanceNN/A
prometheus10041ObservabilityNN/A
puppetmaster10011Infrastructure FoundationsDisable puppet fleet wideCompleted and reverted
rdb10091Service OperationsNN/A
relforge10041Search Platform SREsNN/A
restbase-dev10051Platform Engineering ??NN/A
stat10071Analytics SREsNN/A
thanos-be10021ObservabilityNN/A
thanos-fe10021ObservabilityNN/A

VMs in this row are as follows:

VM NameGaneti HostTeamAction RequiredAction Status
an-airflow1002ganeti1018Analytics SREsNN/A
an-test-ui1001ganeti1015AnalyticsNN/A
an-tool1008ganeti1015Analytics SREsNN/A
an-tool1009ganeti1016Analytics SREsNN/A
debmonitor1002ganeti1014Infrastructure FoundationsNN/A
dragonfly-supernode1001ganeti1017Service OperationsNN/A
failoid1002ganeti1018Infrastructure FoundationsNN/A
kafka-test1006ganeti1014Analytics ?NN/A
kafka-test1007ganeti1013Analytics ?NN/A
kafka-test1008ganeti1016Analytics ?NN/A
kafka-test1009ganeti1013Analytics ?NN/A
kafka-test1010ganeti1015Analytics ?NN/A
kubernetes1015ganeti1015Service OperationsNN/A
kubestagemaster1001ganeti1014Service OperationsNN/A
kubestagetcd1005ganeti1013Service OperationsNN/A
kubetcd1006ganeti1013Service OperationsNN/A
ldap-replica1003ganeti1017Infrastructure FoundationsYComplete
logstash1032ganeti1013ObservabilityNN/A
ml-etcd1001ganeti1014ML teamNN/A
ml-serve-ctrl1001ganeti1017ML team ??NN/A
otrs1001ganeti1014Service Operations ??NN/A
schema1003ganeti1015Analytics SREs & Service OperationsNN/A
zookeeper-test1002ganeti1013Analytics ??NN/A

I have listed the teams, and subscribed relevant individuals to this task, based on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.
Days Before:
  • Prepare config changes (netops)
1h Before Window:
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
  • Warn people of the upcoming maintenance (netops)
After The Change
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Event Timeline

Looking at Ganeti VMs, they fall under three/four categories:

SPOF, will need a maintenance window declared:

  • otrs1001
  • an-tool1008
  • an-tool1009

Will need to be depooled (or cordoned or failed over to the secondary instance)

  • ldap-replica1001
  • ldap-replica1003
  • kubernetes1015
  • schema1003

Some temporary unavailability is fine:

  • debmonitor1002
  • kafka-test1006
  • kafka-test1007
  • kafka-test1008
  • kafka-test1009
  • kafka-test1010
  • dragonfly-supernode1001
  • an-test-ui1001
  • an-airflow1002
  • failoid1002
  • kubetcd1006
  • ml-etcd1001
  • ml-serve-ctrl1001
  • zookeeper-test1002
  • kubestagemaster1001
  • kubestagetcd1005

A failover/depool is needed in case Grafana/Logstash are needed during the maintenance:

  • logstash1032

@cmooney Do the cloudsw switches get impacted by row B updates?

ema updated the task description. (Show Details)

@Gehel I'm not sure who to tag in for cloudelastic here. If there are 2 of them in this row, that could be something that requires attention for redundancy. WMCS is considering an entire cloud freeze or shutdown during this, so no worries about keeping services up. I just wanted to make sure the state would be safe if 2 of three are down briefly.

Ryan should be around tomorrow to double check, but cloudelastic should be resilient to a row failure. Worst case the service will be down for the duration of the outage, but if WMCS is frozen, that should not be an issue.

Change 707221 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Switchover m1-master from dbproxy1012 to dbproxy1014

https://gerrit.wikimedia.org/r/707221

Change 707221 merged by Marostegui:

[operations/dns@master] wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012

https://gerrit.wikimedia.org/r/707221

m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is done, we need to revert that.

Cloud team has decided we have too much in this row, and since breakage is possible if we freeze the cloud intentionally, we are going to try to let it run during the change. 🎲

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:50 PM
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-07-27T14:33:56Z] <mmandere> depool cp10[79-82]).eqiad.wmnet - T286061

Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row B maintenance

cp[1079-1082].eqiad.wmnet

Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance

authdns1001.wikimedia.org

Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance

lvs1014.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-07-27T15:09:11Z] <mmandere> pool cp10[79-82].eqiad.wmnet - T286061

Mentioned in SAL (#wikimedia-operations) [2021-07-27T15:11:23Z] <mmandere> pool authdns1001.wikimedia.org - T286061

Mentioned in SAL (#wikimedia-operations) [2021-07-27T15:11:55Z] <marostegui> Move m1-master from dbproxy1012 to dbproxy1014 T286061

m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is done, we need to revert that.

This has been reverted: https://gerrit.wikimedia.org/r/c/operations/dns/+/708214

Mentioned in SAL (#wikimedia-operations) [2021-07-27T15:17:07Z] <mmandere> pool lvs1014.eqiad.wmnet - T286061