Switch buffer re-partition - Eqiad Row D
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	cmooney
	Jul 2 2021, 4:55 PM

Description

~~Planned for Tues July 20th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~

UPDATE: Works completed without incident at scheduled time. Resolving task.

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row D, to address tail drops observed on some of the devices, which are causing throughput issues.

This is an intrusive change, and will bring all traffic on the row to a complete stop for a short time while the switches reconfigure themselves.

All services should have row redundancy, but we may want to take some pro-active steps in advance to de-pool servers / make things go smoothly.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=8&status=active&role=server

Summary of server types in the row as follows:

Server Name / Prefix	Count	Relevant Team	Action Required	Action Status
mw	36	Service Operations	N	N/A
db	20	Data Persistence	N	N/A
an-worker	14	Analytics SREs	N	N/A
ms-be	9	Data Persistence (Media Storage)	N	N/A
elastic	8	Search Platform SREs	N	N/A
wtp	6	Service Operations	N	N/A
analytics	4	Analytics SREs	N	N/A
cp	4	Traffic	depool the individual hosts with the `depool` command	complete
ganeti	4	Infrastructure Foundations	N	N/A
mc	4	Service Operations	N	N/A
restbase	4	Service Operations	N	N/A
druid	3	Analytics	N	N/A
es	3	Data Persistence	N	N/A
kafka-jumbo	3	Analytics SREs & Infrastructure Foundations	N	N/A
kubernetes	3	Service Operations	N	N/A
ores	3	Machine Learning SREs	N	N/A
an-presto	2	Analytics	N	N/A
aqs	2	Analytics SREs	N	N/A
clouddb	2	WMCS, with support from DBAs	Y	Done
cloudstore	2	WMCS	N	N/A
dbproxy	2	Data Persistence	N	N/A
kafka-main	2	Analytics	Y - Keith to depool in advance	Complete
rdb	2	Service Operations	N - But we want to do quick check afterwards	N/A
stat	2	Analytics SREs	N	N/A
thumbor	2	Service Operations (& Performance)	N	N/A
wdqs	2	Search Platform SREs	N	N/A
an-conf1003	1	Analytics	N	N/A
an-test-coord1001	1	Analytics	N	N/A
an-test-worker1003	1	Analytics	N	N/A
backup1001	1	Data Persistence	Heads up to Jaime before	Completed
bast1003	1	Infrastructure Foundations	Tell people in advance to use a different bastion	Complete
centrallog1001	1	Observability	N	N/A
cloudelastic1004	1	Search Platform	N	N/A
conf1006	1	Service Operations	Y - Advise traffic team afterwards so they can restart any PyBal instances connected to this	To be done after
dbstore1007	1	Analytics SREs & Data Persistence	N	N/A
dns1002	1	Traffic	Y - depool ahead of change	Complete
dumpsdata1002	1	Service Operations & Platform Engineering	N	N/A
flerovium	1	Analytics	N	N/A
labstore1007	1	WMCS	N	N/A
labweb1002	1	WMCS	N	N/A
logstash1012	1	Observability	N	N/A
lvs1016	1	Traffic	N	N/A
maps1004	1		N	N/A
mc-gp1003	1	Service Operations	N	N/A
pc1010	1	SRE Data Persistence (DBAs), with support from Platform and Performance	N	N/A
puppetmaster1002	1	Infrastructure Foundations	Disable puppet fleet wide	Complete
sessionstore1003	1	Service Operations	N	N/A
snapshot1009	1	Service Operations & Platform Engineering	Y	Complete
thanos-be1004	1	Observability	N	N/A
thorium	1	Analytics	N	N/A

The VMs on this row are as follows:

VM Name	Ganeti Host	Team	Action Required	Action Status
an-airflow1003	ganeti1021	Analytics SREs	N	N/A
an-test-druid1001	ganeti1020	Analytics	N	N/A
an-test-presto1001	ganeti1020	Analytics	N	N/A
aphlict1001	ganeti1020	Service Operations	N	N/A
chartmuseum1001	ganeti1019	Service Operations	N	N/A
cuminunpriv1001	ganeti1020	Infrastructure Foundations	N	N/A
dbmonitor1002	ganeti1019	Data Persistence	N	N/A
dborch1001	ganeti1019	Data Persistence	N	N/A
doh1002	ganeti1022	Traffic	N	N/A
eventlog1003	ganeti1022	Analytics SREs	N	N/A
irc1001	ganeti1019	Infra Foundations	Y (failover to irc2001)	Complete
kubernetes1016	ganeti1021	Service Operations	N	N/A
ldap-replica1004	ganeti1021	Infrastructure Foundations	Y	Complete
logstash1030	ganeti1020	Observability	N	N/A
logstash1031	ganeti1019	Observability	N	N/A
ml-etcd1003	ganeti1019	ML team	N	N/A
ml-serve-ctrl1002	ganeti1020	ML team	N	N/A
puppetboard1002	ganeti1021	Infrastructure Foundations	N	N/A
releases1002	ganeti1019	Service Operations	N	N/A
schema1004	ganeti1019	Analytics SREs & Service Operations	N	N/A
search-loader1001	ganeti1020	Search Platform SREs	N	N/A
testreduce1001	ganeti1019	Service Operations	N	N/A
xhgui1001	ganeti1020	Performance	N	N/A

I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.

Days Before:

Prepare config changes (netops)

1h Before Window:

Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
Warn people of the upcoming maintenance (netops)

After The Change

Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Details

	Subject	Repo	Branch	Lines +/-
	Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	cmooney	T284592 Adjust egress buffer allocations on ToR switches
Resolved	None	T286069 Switch buffer re-partition - Eqiad Row D
Resolved	Andrew	T286598 Fail over clouddb1019, clouddb1020 for switch changes
Resolved	• Bstorm	T286599 Downtime? and healthcheck cloudstore1008/9 following row D network change
Resolved	aborrero	T286600 failover cloud NFS from labstore1007 to labstore1006

Event Timeline

cmooney created this task.Jul 2 2021, 4:55 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 2 2021, 4:55 PM

cmooney mentioned this in T284592: Adjust egress buffer allocations on ToR switches.Jul 2 2021, 4:55 PM

RhinosF1 subscribed.Jul 3 2021, 8:28 AM

@Bstorm clouddb1019 and clouddb1020 are on this rack.
@razzi dbstore1007 is on this rack

• Marostegui updated the task description. (Show Details)Jul 5 2021, 4:52 AM

• Marostegui added projects: Analytics, cloud-services-team.Jul 5 2021, 4:54 AM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJul 5 2021, 4:54 AM

• Marostegui added a project: DBA.Jul 5 2021, 7:34 AM

• Marostegui moved this task from Triage to Refine on the DBA board.Jul 5 2021, 7:36 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 5 2021, 9:35 AM

MoritzMuehlenhoff updated the task description. (Show Details)

elukey updated the task description. (Show Details)Jul 12 2021, 7:39 AM

cmooney updated the task description. (Show Details)Jul 12 2021, 8:03 AM

Kormat updated the task description. (Show Details)Jul 12 2021, 9:02 AM

cmooney updated the task description. (Show Details)Jul 12 2021, 9:38 AM

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-d5-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1002 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

MoritzMuehlenhoff updated the task description. (Show Details)Jul 12 2021, 3:45 PM

fgiunchedi updated the task description. (Show Details)Jul 13 2021, 7:31 AM

cmooney updated the task description. (Show Details)Jul 13 2021, 10:16 AM

cmooney updated the task description. (Show Details)Jul 13 2021, 10:44 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 13 2021, 11:16 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 13 2021, 11:26 AM

Kormat updated the task description. (Show Details)Jul 13 2021, 11:51 AM

• ema updated the task description. (Show Details)Jul 13 2021, 12:38 PM

ssingh updated the task description. (Show Details)Jul 13 2021, 1:03 PM

jbond updated the task description. (Show Details)Jul 13 2021, 1:23 PM

• Bstorm updated the task description. (Show Details)Jul 13 2021, 6:27 PM

cloudstore servers are both in the same rack. They are a cluster, and it will simply be offline. We will make a task to verify that it comes back up and is replicating again afterward.

Clouddb's will be best failed over, I think.

I tend to feel that Search Platform might have more input if anything needs to be done for cloudelastic @cmooney. WMCS only keeps them physically running.

• Bstorm updated the task description. (Show Details)Jul 13 2021, 7:20 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 10:20 AM

cmooney updated the task description. (Show Details)Jul 14 2021, 11:57 AM

cmooney updated the task description. (Show Details)Jul 14 2021, 12:04 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 12:53 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 1:05 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 1:17 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 3:06 PM

JMeybohm updated the task description. (Show Details)Jul 14 2021, 3:38 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 10:09 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 10:37 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:05 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:13 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:41 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 12:14 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 12:35 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 12:43 PM

cmooney updated the task description. (Show Details)

cmooney updated the task description. (Show Details)Jul 16 2021, 4:10 PM

Change 705000 had a related patch set uploaded (by Holger Knust; author: Holger Knust):

[operations/puppet@production] Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069

https://gerrit.wikimedia.org/r/705000

gerritbot added a project: Patch-For-Review.Jul 16 2021, 5:10 PM

Change 705000 merged by ArielGlenn:

[operations/puppet@production] Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069

https://gerrit.wikimedia.org/r/705000

MoritzMuehlenhoff updated the task description. (Show Details)Jul 19 2021, 10:29 AM

cmooney updated the task description. (Show Details)Jul 19 2021, 2:26 PM

cmooney updated the task description. (Show Details)Jul 19 2021, 2:36 PM

cmooney updated the task description. (Show Details)Jul 19 2021, 2:40 PM

cmooney updated the task description. (Show Details)Jul 19 2021, 4:02 PM

• Bstorm updated the task description. (Show Details)Jul 19 2021, 11:28 PM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 20 2021, 7:11 AM

aborrero updated the task description. (Show Details)Jul 20 2021, 9:52 AM

aborrero closed subtask T286600: failover cloud NFS from labstore1007 to labstore1006 as Resolved.

ArielGlenn updated the task description. (Show Details)Jul 20 2021, 1:59 PM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 20 2021, 2:02 PM

Mentioned in SAL (#wikimedia-operations) [2021-07-20T14:36:58Z] <vgutierrez> depool cp[1087-1090].eqiad.wmnet - T286069

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 4 host(s) and their services with reason: eqiad row D maintenance

cp[1087-1090].eqiad.wmnet

Vgutierrez updated the task description. (Show Details)Jul 20 2021, 2:42 PM

Mentioned in SAL (#wikimedia-operations) [2021-07-20T14:46:20Z] <vgutierrez> depool dns1002 - T286069

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance

dns1002.wikimedia.org

Vgutierrez updated the task description. (Show Details)Jul 20 2021, 2:49 PM

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance

lvs1016.eqiad.wmnet

cmooney updated the task description. (Show Details)Jul 20 2021, 2:55 PM

cmooney updated the task description. (Show Details)Jul 20 2021, 3:05 PM

cmooney updated the task description. (Show Details)Jul 20 2021, 3:11 PM

Mentioned in SAL (#wikimedia-operations) [2021-07-20T15:21:10Z] <vgutierrez> pool cp[1087-1090].eqiad.wmnet - T286069

Mentioned in SAL (#wikimedia-operations) [2021-07-20T15:23:01Z] <vgutierrez> pool dns1002 - T286069

All works complete, no signs of any issues really, I had no ping loss on 16 pings towards 2 hosts connected off each member switch.

Very happy all went well, thanks for all the effort to get this over the line, it is a very sensitive change so precaution was best, and we still should be cautious for the remainder. I will confirm all required post-actions are complete and then set task status to resolved.

RLazarus awarded a token.Jul 20 2021, 3:54 PM

Andrew closed subtask T286598: Fail over clouddb1019, clouddb1020 for switch changes as Resolved.Jul 20 2021, 5:08 PM

• Bstorm closed subtask T286599: Downtime? and healthcheck cloudstore1008/9 following row D network change as Resolved.Jul 20 2021, 5:53 PM

cmooney closed this task as Resolved.Jul 21 2021, 8:35 AM

cmooney updated the task description. (Show Details)

I just tried to run puppet on an-coord1001 but got:

Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'preform puppetdb maintance - T286069');

I enabled puppet, but it failed with:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=a149dd01eba418d65e8a6493a405732b81876ad1&version=5&certname=an-coord1001.eqiad.wmnet&command=replace_facts&producer-timestamp=2021-08-31T14:20:48.985Z' on at least 1 of the following 'server_urls': https://puppetdb1002.eqiad.wmnet

I disabled puppet again with the same message.

Puppet is under maintenance

Oh, sorry @jbond is doing some maintenance and referenced the wrong phab ticket. Ignore ^

Switch buffer re-partition - Eqiad Row DClosed, ResolvedPublicActions