Page MenuHomePhabricator

Switch buffer re-partition - Eqiad Row B
Open, Needs TriagePublic

Description

Planned for Tues July 27th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row B, to address tail drops observed on some of the devices, which is causing throughput issues.

This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop, as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".

Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=6&status=active&role=server

Summarizing the physical servers / types:

Server Name / PrefixCountRelevant TeamAction RequiredAction Status
mw39Service OperationsNN/A
db21Data PersistenceN, Consider failing over to db1117 if the maintenance takes more than expectedN/A
an-worker14Analytics SREs
cloudvirt12WMCS
elastic9Search Platform SREs
ms-be9Data Persistence (Media Storage)N
ganeti6Infrastructure FoundationsNN/A
analytics5Analytics SREs
wtp5Service OperationsNN/A
cp4Trafficdepool the individual hosts with the depool command
es4Data PersistenceNN/A
restbase4Service OperationsNN/A
cloudcephmon3WMCS
cloudcephosd3WMCS
cloudvirt-wdqs3WMCS
kubernetes3Service OperationsNN/A
wdqs3Search Platform SREs
clouddb2WMCS, with support from DBAs
cloudelastic2WMCS/Search Platform SREsNN/A
cloudnet2WMCS
dbproxy2Data Persistencedbproxy1014 requires action, dbproxy1015 doesn'tDone
druid2Analytics
mc2Service OperationsNN/A
ores2Machine Learning SREs
snapshot2Service Operations & Platform Engineering
thumbor2Service Operations (& Performance)NN/A
an-conf10011Analytics
an-coord10011Analytics SREs
an-launcher10021Analytics SREs
an-master10021Analytics SREs
an-presto10041Analytics SREs
aqs10081Analytics SREs
atlas-eqiad1Infrastructure Foundations
authdns10011TrafficManual depool needed
backup10031Data PersistenceHeads up to Jaime before
cloudcontrol10041WMCS
cloudservices10031WMCS
conf10051Service Operations
dbprov10021Data PersistenceNN/A
dumpsdata10011Service Operations & Platform Engineering
gerrit10011Service Operations (Supportive Services) & Release Engineering
graphite10041ObservabilityN
kafka-jumbo10031Analytics SREs & Infrastructure Foundations
kafka-main10021Analytics
kubestage10021Service OperationsNN/A
labweb10011WMCS
logstash10111ObservabilityN
lvs10141TrafficFailover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled
maps10021??
mwmaint10021Service OperationsNN/A
pc10081SRE Data Persistence (DBAs), with support from Platform and PerformanceNN/A
prometheus10041ObservabilityN
puppetmaster10011Infrastructure FoundationsDisable puppet fleet wide
rdb10091Service Operations
relforge10041Search Platform SREs
restbase-dev10051Platform Engineering ??
stat10071Analytics SREs
thanos-be10021ObservabilityN
thanos-fe10021ObservabilityN

VMs in this row are as follows:

VM NameGaneti HostTeamAction RequiredAction Status
an-airflow1002ganeti1018Analytics SREs
an-test-ui1001ganeti1015Analytics
an-tool1008ganeti1015Analytics SREs
an-tool1009ganeti1016Analytics SREs
debmonitor1002ganeti1014Infrastructure FoundationsNN/A
dragonfly-supernode1001ganeti1017Service Operations
failoid1002ganeti1018Infrastructure FoundationsNN/A
kafka-test1006ganeti1014Analytics ?
kafka-test1007ganeti1013Analytics ?
kafka-test1008ganeti1016Analytics ?
kafka-test1009ganeti1013Analytics ?
kafka-test1010ganeti1015Analytics ?
kubernetes1015ganeti1015Service OperationsNN/A
kubestagemaster1001ganeti1014Service OperationsNN/A
kubestagetcd1005ganeti1013Service OperationsNN/A
kubetcd1006ganeti1013Service OperationsNN/A
ldap-replica1003ganeti1017Infrastructure FoundationsY
logstash1032ganeti1013ObservabilityN
ml-etcd1001ganeti1014ML team
ml-serve-ctrl1001ganeti1017ML team ??
otrs1001ganeti1014Service Operations ??
schema1003ganeti1015Analytics SREs & Service Operations
zookeeper-test1002ganeti1013Analytics ??

I have listed the teams, and subscribed relevant individuals to this task, based on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.
Days Before:
  • Prepare config changes (netops)
1h Before Window:
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
  • Warn people of the upcoming maintenance (netops)
After The Change
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Event Timeline

Looking at Ganeti VMs, they fall under three/four categories:

SPOF, will need a maintenance window declared:

  • otrs1001
  • an-tool1008
  • an-tool1009

Will need to be depooled (or cordoned or failed over to the secondary instance)

  • ldap-replica1001
  • ldap-replica1003
  • kubernetes1015
  • schema1003

Some temporary unavailability is fine:

  • debmonitor1002
  • kafka-test1006
  • kafka-test1007
  • kafka-test1008
  • kafka-test1009
  • kafka-test1010
  • dragonfly-supernode1001
  • an-test-ui1001
  • an-airflow1002
  • failoid1002
  • kubetcd1006
  • ml-etcd1001
  • ml-serve-ctrl1001
  • zookeeper-test1002
  • kubestagemaster1001
  • kubestagetcd1005

A failover/depool is needed in case Grafana/Logstash are needed during the maintenance:

  • logstash1032

@cmooney Do the cloudsw switches get impacted by row B updates?

ema updated the task description. (Show Details)

@Gehel I'm not sure who to tag in for cloudelastic here. If there are 2 of them in this row, that could be something that requires attention for redundancy. WMCS is considering an entire cloud freeze or shutdown during this, so no worries about keeping services up. I just wanted to make sure the state would be safe if 2 of three are down briefly.

Ryan should be around tomorrow to double check, but cloudelastic should be resilient to a row failure. Worst case the service will be down for the duration of the outage, but if WMCS is frozen, that should not be an issue.

Change 707221 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Switchover m1-master from dbproxy1012 to dbproxy1014

https://gerrit.wikimedia.org/r/707221

Change 707221 merged by Marostegui:

[operations/dns@master] wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012

https://gerrit.wikimedia.org/r/707221

m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is done, we need to revert that.