Page MenuHomePhabricator

Switch buffer re-partition - Eqiad Row C
Closed, ResolvedPublic

Description

Planned for Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)

Completed on schedule, no issues to report.

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row C, to address tail drops observed on some of the devices, which is causing throughput issues.

This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop for a short time as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".

Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=7&status=active&role=server

Summary of hosts by type here:

Server Name / PrefixCountRelevant TeamAction RequiredAction Status
mw39Service OperationsNN/A
db23Data PersistenceNN/A
an-worker16Analytics SREsNN/A
elastic9Search Platform SREsNN/A
ms-be7Data Persistence (Media Storage)NN/A
wtp6Service OperationsNN/A
analytics5Analytics SREsNN/A
mc5Service OperationsNN/A
cp4Trafficdepool the individual hosts with the depool commandComplete
dbproxy4Data Persistencedbproxy1018 and dbproxy1019 owned by cloud-services-team, dbproxy1020 requires action after row D is done, dbroxy1021 doesn'tdbproxy1020 has been depooled by DBA
ganeti4Infrastructure FoundationsNN/A
es3Data PersistenceNN/A
kafka-jumbo3Analytics SREs & Infrastructure FoundationsNN/A
kubernetes3Service OperationsNN/A
clouddb2WMCS, with support from DBAsY
labstore2WMCSY
ms-fe2Data Persistence (Media Storage)NN/A
ores2Machine Learning SREsNN/A
wdqs2Search Platform SREsNN/A
alert10011ObservabilityNN/A
an-conf10021AnalyticsNN/A
an-druid10021AnalyticsNN/A
an-test-master10021AnalyticsNN/A
an-test-worker10021AnalyticsNN/A
aqs10051Analytics SREsNN/A
backup10021Data PersistenceHeads up to Jaime beforeN/A
cloudcontrol10051WMCS
cloudelastic10031WMCS
cloudmetrics10011WMCSNN/A
cumin10011Infrastructure FoundationsY - Tell other SREs to use cumin2002 instead that day. Announce it few days earlier as some cookbooks takes days to run.Complete
dbprov10031Data PersistenceNN/A
dbstore10051Analytics SREs & Data PersistenceNN/A
druid10021AnalyticsNN/A
dumpsdata10031Service Operations & Platform EngineeringY
kafka-main10031SREY - Keith to depool in advanceComplete (re-pooled)
lvs10151TrafficFailover to secondary (lvs1016 in row D) by stopping pybal with puppet disabledComplete
maps10031NN/A
mc-gp10021Service OperationsNN/A
ms-backup10021Data Persistence ??NN/A
mwlog10021Service OperationsNN/A
pc10091SRE Data Persistence (DBAs), with support from Platform and PerformanceNN/A
sessionstore10021Service OperationsNN/A
thanos-be10031ObservabilityNN/A
thanos-fe10031ObservabilityNN/A

VMs on this row are as follows:

VM NameGaneti HostTeamAction RequiredAction Status
acmechief-test1001ganeti1009TrafficNN/A
acmechief1001ganeti1009TrafficNN/A
an-airflow1001ganeti1010Analytics SREsNN/A
an-tool1005ganeti1009Analytics SREsNN/A
an-tool1007ganeti1010Analytics SREsNN/A
doc1002ganeti1009NN/A
doh1001ganeti1011TrafficNN/A
etherpad1002ganeti1009Service OperationsNN/A
flowspec1001ganeti1009Infrastructure FoundationsNN/A
idp-test1001ganeti1010Infrastructure FoundationsNN/A
kubemaster1002ganeti1010Service OperationsNN/A
kubernetes1006ganeti1009Service OperationsNN/A
kubestagetcd1006ganeti1012Service OperationsNN/A
kubetcd1004ganeti1010Service OperationsNN/A
logstash1009ganeti1010ObservabilityNN/A
logstash1025ganeti1009ObservabilityNN/A
matomo1002ganeti1009AnalyticsNN/A
miscweb1002ganeti1009Service OperationsNN/A
ml-etcd1002ganeti1012ML teamNN/A
mwdebug1001ganeti1010Service OperationsNN/A
mx1001ganeti1009Infrastructure FoundationsNN/A
ncredir1001ganeti1009TrafficNN/A
netflow1001ganeti1012Infrastructure FoundationsNN/A
orespoolcounter1004ganeti1010Machine Learning SREsNN/A
ping1001ganeti1009Infrastructure FoundationsNN/A
poolcounter1005ganeti1010Service OperationsNN/A
puppetboard1001ganeti1010Infrastructure FoundationsNN/A
puppetdb1002ganeti1012Infrastructure FoundationsY (disable Puppet fleet-wide during maintenance)Complete
registry1004ganeti1009Service OperationsNN/A
rpki1001ganeti1009Infrastructure FoundationsNN/A
seaborgiumganeti1010Infrastructure FoundationsNN/A
urldownloader1002ganeti1010Infrastructure FoundationsNN/A

I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.
Days Before:
  • Prepare config changes (netops)
  • Email ops list as a reminder
1h Before Window:
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
  • Warn people of the upcoming maintenance on IRC(netops)
After The Change
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui added a project: Analytics.
Marostegui added a subscriber: razzi.

@Bstorm I assume we are ok with having a glitch on clouddb1017 and 1018?
@razzi dbstore1005 is on this rack.

Looking at Ganeti VMs, they fall under three/four categories:

SPOF, will need a maintenance window declared:

  • an-tool1005
  • an-tool1007
  • etherpad1002 (or don't care, given that it's effort)
  • matomo1002

Will need to be depooled (or cordoned or failed over to the secondary instance)

  • acmechief1001
  • doh1001
  • kubemaster1002
  • kubernetes1006
  • poolcounter1005
  • ping1001
  • puppetdb1002 (we could stop Puppet through the maintenance window)
  • orespoolcounter1004
  • ncredir1001
  • registry1004
  • rpki1001
  • urldownloader1002

Some temporary unavailability is fine:

  • acmechief-test1001
  • doc1002
  • miscweb1002 ganeti1009 Service Operations
  • ml-etcd1002 ganeti1012 ML team
  • mwdebug1001 ganeti1010 Service Operations
  • mx1001
  • an-airflow1001
  • flowspec1001
  • idp-test1001
  • kubestagetcd1006
  • kubetcd1004
  • netflow1001
  • puppetboard1001
  • seaborgium

A failover/depool is needed in case Grafana/Logstash are needed during the maintenance:

  • logstash1009
  • logstash1025

@aborrero does cloudgw require manual failover?

it doesn't require manual failover, but we could depool (or force failover) the server beforehand to soften any potential operational impact.

cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-c8-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1001 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure donors aren't impacted by any possible extended downtime.

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-c8-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1001 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

That does!

@Dwisehaupt I have to apologize for a newbie error I made here. The FR-Tech server's interfaces don't show what they're connected to in Netbox. I incorrectly assumed they were connected to the same ASWs as other devices in the row (albeit on separate vlans behind the PFWs).

Faidon spotted my error and put me right, the below hosts are actually connected to fasw-c1a-eqiad and fasw-c1b-eqiad, and thus won't be affected by this change. I've removed them from the list now, my apologies for the confusion, and thanks for taking the time to review.

civi1001
fran1001
frauth1001
frban1001
frbast1001
frdata1002
frdb1002
frdb1003
frdb1004
frdev1001
frlog1001
frmon1001
frmx1001
frnetmon1001
frpig1001
frpm1001
frqueue1003
frqueue1004
pay-lvs1001
pay-lvs1002
payments1001
payments1005
payments1007
payments1008
ema updated the task description. (Show Details)

@cmooney Not a problem. I have updated the task to remove the pre/post tasks we were looking at. Always good for us to think about failure scenarios anyway. :)

Switching cloudmetrics to just eat the brief outage. I don't think it will be a big deal. We can just check it after.

The WMCS-owned dbproxy1018 and 1019 make up the entire cluster, so that's just an outage no matter what for wikireplicas. It will just need downtime, communication and checks afterward.

cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

Change 705384 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] swap roles of dumpsdata1001, 1003 in prep for T286065

https://gerrit.wikimedia.org/r/705384

Change 705384 merged by ArielGlenn:

[operations/puppet@production] swap roles of dumpsdata1001, 1003 in prep for T286065

https://gerrit.wikimedia.org/r/705384

Change 705789 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Switch m3-master

https://gerrit.wikimedia.org/r/705789

Change 705789 merged by Marostegui:

[operations/dns@master] wmnet: Switch m3-master

https://gerrit.wikimedia.org/r/705789

I have switched m3-master from dbproxy1020 to dbproxy1016: https://gerrit.wikimedia.org/r/705789

cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-07-22T14:37:42Z] <mmandere> depool cp108[3-6].eqiad.wmnet - T286065

Icinga downtime set by mmandere@cumin2002 for 1:00:00 4 host(s) and their services with reason: Eqiad row C maintenance

cp[1083-1086].eqiad.wmnet

Icinga downtime set by mmandere@cumin2002 for 1:00:00 1 host(s) and their services with reason: Eqiad row C maintenance

lvs1015.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-07-22T15:11:20Z] <mmandere> pool cp108[3-6].eqiad.wmnet - T286065

All went very well with the change, this time I ran rapid ping from the CR to see if any packet loss was observed, and did detect some loss, but it was extremely brief

cmooney@re0.cr1-eqiad> ping 10.64.32.162 rapid count 100000   
PING 10.64.32.162 (10.64.32.162): 56 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<#####  output removed #####>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 10.64.32.162 ping statistics ---
100000 packets transmitted, 99998 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.262/0.503/35.295/0.519 ms

Resolving task.

Change 708724 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] Swap dumpsdata rols back now that network maintenance is complete

https://gerrit.wikimedia.org/r/708724

Change 708724 merged by ArielGlenn:

[operations/puppet@production] Swap dumpsdata rols back now that network maintenance is complete

https://gerrit.wikimedia.org/r/708724