Page MenuHomePhabricator

Switch buffer re-partition - Eqiad Row A
Closed, ResolvedPublic

Description

Planned for Thurs July 29th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)

Completed on schedule - no issues to report

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row A, to address tail drops observed on some of the devices, which is causing throughput issues.

This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop, as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".

Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=5&status=active&role=server

Summarizing the physical servers / types:

Server Name / PrefixCountRelevant TeamAction RequiredAction Status
mw28Service OperationsNN/A
db23Data PersistenceY: T284622 T286042Done
an-worker14Analytics SREsNN/A
ms-be10Data Persistence (Media Storage)NN/A
elastic9Search Platform SREsNN/A
wtp6Service OperationsNN/A
analytics5Analytics SREsNN/A
es5Data PersistenceNN/A
mc5Service OperationsNN/A
cp4Trafficdepool the individual hosts with the depool commandComplete
ganeti4Infrastructure FoundationsNN/A
kubernetes4Service OperationsNN/A
restbase4PET @hnowlanNN/A
clouddb3WMCS and Analytics, with support from DBAsNN/A
wdqs3Search Platform SREsNN/A
an-presto2Analytics SREsNN/A
aqs2Analytics SREsNN/A
cloudelastic2WMCSNN/A
dbproxy2Data PersistenceY: dbroxy1012 and dbproxy1013 needs failover to another hostsdbproxy1012: DONE dbproxy1013: DONE
druid / an-druid3Analytics SREsNN/A
kafka-jumbo2Analytics SREsNN/A
ms-fe2Data Persistence (Media Storage)NN/A
ores2Machine Learning SREsNN/A
stat2Analytics SREsNN/A
an-master10011Analytics SREsNN/A
an-test-master10011Analytics SREsNN/A
an-test-worker10011Analytics SREsNN/A
cloudcontrol10031WMCSNN/A
cloudmetrics10021WMCSNN/A
cloudservices10041WMCSNN/A
conf10041Service OperationsN (Check after)N/A
contint10011Service OperationsNN/A
dbprov10011Data PersistenceNN/A
dbstore10031Analytics SREs & Data PersistenceNN/A
dns10011TrafficComplete
htmldumper10011NN/A
kafka-main10011AnalyticsNN/A
krb10011Infrastructure Foundations & Analytics SREsNN/A
kubestage10011Service OperationsNN/A
labstore10061WMCSNN/A
logstash10101ObservabilityNN/A
lvs10131TrafficFailover to secondary (lvs1016 in row D) by stopping pybal with puppet disabledComplete
maps10011PET @hnowlanNN/A
mc-gp10011Service OperationsNN/A
ms-backup10011Data PersistenceNN/A
netmon10021Infrastructure FoundationsNN/A
pc10071SRE Data Persistence (DBAs), with support from Platform and PerformanceNN/A
pki10011Infrastructure FoundationsNN/A
prometheus10031ObservabilityNN/A
rdb10051Service OperationsNN/A
relforge10031Search Platform SREsNN/A
restbase-dev10041Platform EngineeringNN/A
sessionstore10011Service OperationsNN/A
sodium1Infrastructure FoundationsNN/A
thanos-be10011ObservabilityNN/A
thanos-fe10011ObservabilityNN/A

And the list of VMs running on Ganeti hosts in row A:

VM NameGaneti HostTeamAction RequiredAction Status
apt1001ganeti1007Infrastructure FoundationsNN/A
archiva1002ganeti1005Analytics SREsNN/A
d-i-testganeti1005Infrastructure FoundationsNN/A
doc1001ganeti1005Service OperationsNN/A
gitlab1001ganeti1007Service OperationsNN/A
grafana1002ganeti1007ObservabilityNN/A
idp1001ganeti1005Infrastructure FoundationsY (ensure that 2001 is the active)idp2001 is currently the active host
install1003ganeti1006Infrastructure FoundationsNN/A
kafkamon1001ganeti1006Analytics SREs & Infrastructure FoundationsNN/A
kafkamon1002ganeti1005Analytics SREs & Infrastructure FoundationsNN/A
kubemaster1001ganeti1005Service Operations ??/NN/A
kubernetes1005ganeti1006Service OperationsNN/A
kubestagetcd1004ganeti1005Service OperationsNN/A
kubetcd1005ganeti1008Service OperationsNN/A
ldap-corp1001ganeti1007Infrastructure FoundationsNN/A
lists1001ganeti1007Service Operations and @LadsgroupYDone, maintenance window announced
logstash1007ganeti1006ObservabilityNN/A
logstash1008ganeti1005ObservabilityNN/A
logstash1023ganeti1005ObservabilityNN/A
logstash1024ganeti1005ObservabilityNN/A
moscoviumganeti1006NN/A
mwdebug1002ganeti1006Service OperationsNN/A
ncredir1002ganeti1005TrafficNN/A
netbox1001ganeti1005Infrastructure FoundationsNN/A
netboxdb1001ganeti1005Infrastructure FoundationsNN/A
orespoolcounter1003ganeti1005Machine Learning SREsNN/A
people1003ganeti1008Service OperationsNN/A
planet1002ganeti1006Service OperationsNN/A
poolcounter1004ganeti1005Service OperationsNN/A
registry1003ganeti1005Service OperationsNN/A
urldownloader1001ganeti1007Infrastructure FoundationsNN/A
webperf1001ganeti1007Performance & Service OperationsNN/A
webperf1002ganeti1006Performance & Service OperationsNN/A

I have listed the teams, and subscribed relevant individuals to this task, based on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update above tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.
Days Before:
  • Prepare config changes (netops)
1h Before Window:
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
  • Warn people of the upcoming maintenance (netops)
After The Change
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row.
Due to the virtual-chassis nature of our network rows, we should assume that a whole row can fail at anytime.

@Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row.
Due to the virtual-chassis nature of our network rows, we should assume that a whole row can fail at anytime.

We try to do that as much as we can, but it is hard to keep all the variables in place, especially with all the HW replacements we've had lately.

@Marostegui Our only real option to test is on new switches due to be installed under T277340. We are working with DC-Ops to try to get them installed as early as we can, and hope to get them in early the week before this is planned for. But we are dependent on DC Ops to get it over the line, and they're flat out right now.

Ok, I think what we can do from our side is to get the replacement hosts ready but without failing over things to them, we can just get them ready in case the maintenance takes longer than a few seconds.

Speaking on behalf of:

dbprov1001
ms-backup1001
db1116

That could cause ongoing backup runs to fail, but that is "normal" (network outages) and it will just be detected and corrected afterwards. Only if the outage would take days we would take some action to "depool them".

One thing we could do is massively downtime hosts in advance by row to prevent monitoring overload.

cmooney updated the task description. (Show Details)

@Andrew Just a heads up that cloudcontrol1003 is in the list. It might be fine and will catch up, but it also could crash rabbit queues and generally require restarts when done, I think. I don't know if we need a separate task for that.

Legoktm added a subscriber: Ladsgroup.
Legoktm subscribed.

lists1001 is a SPOF currently, we'll probably just announce a downtime when we get closer to the actual time

@Ottomata One of the cloudbs is clouddb1021. FYI. I understand you likely won't be using it that late in the month, but I wanted to make sure your group noticed.

I can draft an announcement for downtime of lists.wikimedia.org, maybe we can use the time to increase its capacity (more CPU?) I leave that to Legoktm.

Impacted clouddb's will be clouddb1013, clouddb1014, clouddb1021.

I believe interrupting traffic on 2 of 4 of the "web" replicas is likely easier than the "analytics" replicas based on query volume and scope.

Looking at Ganeti VMs, they broadly fall under three/four categories:

SPOF, will need a maintenance window declared:

  • archiva1002
  • lists1001

Will need to be depooled (or cordoned or failed over to the secondary instance)

  • idp1001
  • kubemaster1001
  • kubernetes1005
  • ldap-corp1001
  • urldownloader1001
  • ncredir1002
  • registry1003

Some temporary unavailability is fine:

  • apt1001
  • d-i-test
  • doc1001
  • gitlab1001
  • kafkamon1001
  • kafkamon1002
  • logstash1007
  • logstash1008
  • install1003
  • kubestagetcd1004
  • kubetcd1005
  • moscovium
  • mwdebug1002
  • people1003
  • planet1002
  • webperf1001
  • webperf1002
  • netbox1001
  • netboxdb1001
  • poolcounter1004
  • orespoolcounter1003

A failover/depool is needed in case Grafana/Logstash are needed during the maintenance:

  • grafana1002
  • logstash1023
  • logstash1024

Looking at Ganeti VMs, they broadly fall under three/four categories:

SPOF, will need a maintenance window declared:

  • archiva1002

It should be fine to skip the maintenance window for few seconds of unavailability for the Archiva use case (to ease Netops' work)

cmooney updated the task description. (Show Details)
Legoktm triaged this task as Medium priority.Jul 26 2021, 11:53 PM

Change 708646 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Failover m2-master to dbproxy1015

https://gerrit.wikimedia.org/r/708646

Change 708646 merged by Marostegui:

[operations/dns@master] wmnet: Failover m2-master to dbproxy1015

https://gerrit.wikimedia.org/r/708646

m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need to revert this.

cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-07-29T14:35:46Z] <mmandere> depool cp107[5-8].eqiad.wmnet - T286032

Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row A maintenance

cp[1075-1078].eqiad.wmnet

Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance

dns1001.wikimedia.org

Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance

lvs1013.eqiad.wmnet

m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need to revert this.

This has been reverted via https://gerrit.wikimedia.org/r/c/operations/dns/+/708641

Mentioned in SAL (#wikimedia-operations) [2021-07-29T15:07:15Z] <mmandere> pool cp107[5-8].eqiad.wmnet - T286032

Mentioned in SAL (#wikimedia-operations) [2021-07-29T15:09:13Z] <mmandere> pool dns1001.wikimedia.org - T286032

Mentioned in SAL (#wikimedia-operations) [2021-07-29T15:11:57Z] <mmandere> pool lvs1013.eqiad.wmnet - T286032

cmooney updated the task description. (Show Details)