Page MenuHomePhabricator

Switch buffer re-partition - Eqiad Row D
Closed, ResolvedPublic

Description

Planned for Tues July 20th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)

UPDATE: Works completed without incident at scheduled time. Resolving task.

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row D, to address tail drops observed on some of the devices, which are causing throughput issues.

This is an intrusive change, and will bring all traffic on the row to a complete stop for a short time while the switches reconfigure themselves.

All services should have row redundancy, but we may want to take some pro-active steps in advance to de-pool servers / make things go smoothly.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=8&status=active&role=server

Summary of server types in the row as follows:

Server Name / PrefixCountRelevant TeamAction RequiredAction Status
mw36Service OperationsNN/A
db20Data PersistenceNN/A
an-worker14Analytics SREsNN/A
ms-be9Data Persistence (Media Storage)NN/A
elastic8Search Platform SREsNN/A
wtp6Service OperationsNN/A
analytics4Analytics SREsNN/A
cp4Trafficdepool the individual hosts with the depool commandcomplete
ganeti4Infrastructure FoundationsNN/A
mc4Service OperationsNN/A
restbase4Service OperationsNN/A
druid3AnalyticsNN/A
es3Data PersistenceNN/A
kafka-jumbo3Analytics SREs & Infrastructure FoundationsNN/A
kubernetes3Service OperationsNN/A
ores3Machine Learning SREsNN/A
an-presto2AnalyticsNN/A
aqs2Analytics SREsNN/A
clouddb2WMCS, with support from DBAsYDone
cloudstore2WMCSNN/A
dbproxy2Data PersistenceNN/A
kafka-main2AnalyticsY - Keith to depool in advanceComplete
rdb2Service OperationsN - But we want to do quick check afterwardsN/A
stat2Analytics SREsNN/A
thumbor2Service Operations (& Performance)NN/A
wdqs2Search Platform SREsNN/A
an-conf10031AnalyticsNN/A
an-test-coord10011AnalyticsNN/A
an-test-worker10031AnalyticsNN/A
backup10011Data PersistenceHeads up to Jaime beforeCompleted
bast10031Infrastructure FoundationsTell people in advance to use a different bastionComplete
centrallog10011ObservabilityNN/A
cloudelastic10041Search PlatformNN/A
conf10061Service OperationsY - Advise traffic team afterwards so they can restart any PyBal instances connected to thisTo be done after
dbstore10071Analytics SREs & Data PersistenceNN/A
dns10021TrafficY - depool ahead of changeComplete
dumpsdata10021Service Operations & Platform EngineeringNN/A
flerovium1AnalyticsNN/A
labstore10071WMCSNN/A
labweb10021WMCSNN/A
logstash10121ObservabilityNN/A
lvs10161TrafficNN/A
maps10041NN/A
mc-gp10031Service OperationsNN/A
pc10101SRE Data Persistence (DBAs), with support from Platform and PerformanceNN/A
puppetmaster10021Infrastructure FoundationsDisable puppet fleet wideComplete
sessionstore10031Service OperationsNN/A
snapshot10091Service Operations & Platform EngineeringYComplete
thanos-be10041ObservabilityNN/A
thorium1AnalyticsNN/A

The VMs on this row are as follows:

VM NameGaneti HostTeamAction RequiredAction Status
an-airflow1003ganeti1021Analytics SREsNN/A
an-test-druid1001ganeti1020AnalyticsNN/A
an-test-presto1001ganeti1020AnalyticsNN/A
aphlict1001ganeti1020Service OperationsNN/A
chartmuseum1001ganeti1019Service OperationsNN/A
cuminunpriv1001ganeti1020Infrastructure FoundationsNN/A
dbmonitor1002ganeti1019Data PersistenceNN/A
dborch1001ganeti1019Data PersistenceNN/A
doh1002ganeti1022TrafficNN/A
eventlog1003ganeti1022Analytics SREsNN/A
irc1001ganeti1019Infra FoundationsY (failover to irc2001)Complete
kubernetes1016ganeti1021Service OperationsNN/A
ldap-replica1004ganeti1021Infrastructure FoundationsYComplete
logstash1030ganeti1020ObservabilityNN/A
logstash1031ganeti1019ObservabilityNN/A
ml-etcd1003ganeti1019ML teamNN/A
ml-serve-ctrl1002ganeti1020ML teamNN/A
puppetboard1002ganeti1021Infrastructure FoundationsNN/A
releases1002ganeti1019Service OperationsNN/A
schema1004ganeti1019Analytics SREs & Service OperationsNN/A
search-loader1001ganeti1020Search Platform SREsNN/A
testreduce1001ganeti1019Service OperationsNN/A
xhgui1001ganeti1020PerformanceNN/A

I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.
Days Before:
  • Prepare config changes (netops)
1h Before Window:
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
  • Warn people of the upcoming maintenance (netops)
After The Change
  • Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Event Timeline

Marostegui added a subscriber: razzi.

@Bstorm clouddb1019 and clouddb1020 are on this rack.
@razzi dbstore1007 is on this rack

cmooney added a subscriber: aborrero.

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-d5-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1002 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

cloudstore servers are both in the same rack. They are a cluster, and it will simply be offline. We will make a task to verify that it comes back up and is replicating again afterward.

Clouddb's will be best failed over, I think.

I tend to feel that Search Platform might have more input if anything needs to be done for cloudelastic @cmooney. WMCS only keeps them physically running.

cmooney updated the task description. (Show Details)

Change 705000 had a related patch set uploaded (by Holger Knust; author: Holger Knust):

[operations/puppet@production] Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069

https://gerrit.wikimedia.org/r/705000

Change 705000 merged by ArielGlenn:

[operations/puppet@production] Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069

https://gerrit.wikimedia.org/r/705000

Mentioned in SAL (#wikimedia-operations) [2021-07-20T14:36:58Z] <vgutierrez> depool cp[1087-1090].eqiad.wmnet - T286069

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 4 host(s) and their services with reason: eqiad row D maintenance

cp[1087-1090].eqiad.wmnet

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance

dns1002.wikimedia.org

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance

lvs1016.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-07-20T15:21:10Z] <vgutierrez> pool cp[1087-1090].eqiad.wmnet - T286069

All works complete, no signs of any issues really, I had no ping loss on 16 pings towards 2 hosts connected off each member switch.

Very happy all went well, thanks for all the effort to get this over the line, it is a very sensitive change so precaution was best, and we still should be cautious for the remainder. I will confirm all required post-actions are complete and then set task status to resolved.

cmooney updated the task description. (Show Details)

I just tried to run puppet on an-coord1001 but got:

Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'preform puppetdb maintance - T286069');

I enabled puppet, but it failed with:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=a149dd01eba418d65e8a6493a405732b81876ad1&version=5&certname=an-coord1001.eqiad.wmnet&command=replace_facts&producer-timestamp=2021-08-31T14:20:48.985Z' on at least 1 of the following 'server_urls': https://puppetdb1002.eqiad.wmnet

I disabled puppet again with the same message.

Oh, sorry @jbond is doing some maintenance and referenced the wrong phab ticket. Ignore ^