Switch buffer re-partition - Eqiad Row C
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	cmooney
	Jul 2 2021, 4:27 PM

Description

~~Planned for Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~

Completed on schedule, no issues to report.

Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row C, to address tail drops observed on some of the devices, which is causing throughput issues.

This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop for a short time as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".

Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.

The complete list of servers in this row can be found here:

https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=7&status=active&role=server

Summary of hosts by type here:

Server Name / Prefix	Count	Relevant Team	Action Required	Action Status
mw	39	Service Operations	N	N/A
db	23	Data Persistence	N	N/A
an-worker	16	Analytics SREs	N	N/A
elastic	9	Search Platform SREs	N	N/A
ms-be	7	Data Persistence (Media Storage)	N	N/A
wtp	6	Service Operations	N	N/A
analytics	5	Analytics SREs	N	N/A
mc	5	Service Operations	N	N/A
cp	4	Traffic	depool the individual hosts with the `depool` command	Complete
dbproxy	4	Data Persistence	dbproxy1018 and dbproxy1019 owned by cloud-services-team, dbproxy1020 requires action after row D is done, dbroxy1021 doesn't	dbproxy1020 has been depooled by DBA
ganeti	4	Infrastructure Foundations	N	N/A
es	3	Data Persistence	N	N/A
kafka-jumbo	3	Analytics SREs & Infrastructure Foundations	N	N/A
kubernetes	3	Service Operations	N	N/A
clouddb	2	WMCS, with support from DBAs	Y
labstore	2	WMCS	Y
ms-fe	2	Data Persistence (Media Storage)	N	N/A
ores	2	Machine Learning SREs	N	N/A
wdqs	2	Search Platform SREs	N	N/A
alert1001	1	Observability	N	N/A
an-conf1002	1	Analytics	N	N/A
an-druid1002	1	Analytics	N	N/A
an-test-master1002	1	Analytics	N	N/A
an-test-worker1002	1	Analytics	N	N/A
aqs1005	1	Analytics SREs	N	N/A
backup1002	1	Data Persistence	Heads up to Jaime before	N/A
cloudcontrol1005	1	WMCS
cloudelastic1003	1	WMCS
cloudmetrics1001	1	WMCS	N	N/A
cumin1001	1	Infrastructure Foundations	Y - Tell other SREs to use cumin2002 instead that day. Announce it few days earlier as some cookbooks takes days to run.	Complete
dbprov1003	1	Data Persistence	N	N/A
dbstore1005	1	Analytics SREs & Data Persistence	N	N/A
druid1002	1	Analytics	N	N/A
dumpsdata1003	1	Service Operations & Platform Engineering	Y
kafka-main1003	1	SRE	Y - Keith to depool in advance	Complete (re-pooled)
lvs1015	1	Traffic	Failover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled	Complete
maps1003	1		N	N/A
mc-gp1002	1	Service Operations	N	N/A
ms-backup1002	1	Data Persistence ??	N	N/A
mwlog1002	1	Service Operations	N	N/A
pc1009	1	SRE Data Persistence (DBAs), with support from Platform and Performance	N	N/A
sessionstore1002	1	Service Operations	N	N/A
thanos-be1003	1	Observability	N	N/A
thanos-fe1003	1	Observability	N	N/A

VMs on this row are as follows:

VM Name	Ganeti Host	Team	Action Required	Action Status
acmechief-test1001	ganeti1009	Traffic	N	N/A
acmechief1001	ganeti1009	Traffic	N	N/A
an-airflow1001	ganeti1010	Analytics SREs	N	N/A
an-tool1005	ganeti1009	Analytics SREs	N	N/A
an-tool1007	ganeti1010	Analytics SREs	N	N/A
doc1002	ganeti1009		N	N/A
doh1001	ganeti1011	Traffic	N	N/A
etherpad1002	ganeti1009	Service Operations	N	N/A
flowspec1001	ganeti1009	Infrastructure Foundations	N	N/A
idp-test1001	ganeti1010	Infrastructure Foundations	N	N/A
kubemaster1002	ganeti1010	Service Operations	N	N/A
kubernetes1006	ganeti1009	Service Operations	N	N/A
kubestagetcd1006	ganeti1012	Service Operations	N	N/A
kubetcd1004	ganeti1010	Service Operations	N	N/A
logstash1009	ganeti1010	Observability	N	N/A
logstash1025	ganeti1009	Observability	N	N/A
matomo1002	ganeti1009	Analytics	N	N/A
miscweb1002	ganeti1009	Service Operations	N	N/A
ml-etcd1002	ganeti1012	ML team	N	N/A
mwdebug1001	ganeti1010	Service Operations	N	N/A
mx1001	ganeti1009	Infrastructure Foundations	N	N/A
ncredir1001	ganeti1009	Traffic	N	N/A
netflow1001	ganeti1012	Infrastructure Foundations	N	N/A
orespoolcounter1004	ganeti1010	Machine Learning SREs	N	N/A
ping1001	ganeti1009	Infrastructure Foundations	N	N/A
poolcounter1005	ganeti1010	Service Operations	N	N/A
puppetboard1001	ganeti1010	Infrastructure Foundations	N	N/A
puppetdb1002	ganeti1012	Infrastructure Foundations	Y (disable Puppet fleet-wide during maintenance)	Complete
registry1004	ganeti1009	Service Operations	N	N/A
rpki1001	ganeti1009	Infrastructure Foundations	N	N/A
seaborgium	ganeti1010	Infrastructure Foundations	N	N/A
urldownloader1002	ganeti1010	Infrastructure Foundations	N	N/A

I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.

Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.

Days Before:

Prepare config changes (netops)
Email ops list as a reminder

1h Before Window:

Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
Warn people of the upcoming maintenance on IRC(netops)

After The Change

Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

Details

	Subject	Repo	Branch	Lines +/-
	Swap dumpsdata rols back now that network maintenance is complete	operations/puppet	production	+8 -8
	wmnet: Switch m3-master	operations/dns	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	cmooney	T284592 Adjust egress buffer allocations on ToR switches
Resolved	None	T286065 Switch buffer re-partition - Eqiad Row C
Resolved	aborrero	T286601 Stop some services before and healthcheck labstore1004/5 following row C network change
Resolved	None	T286613 check for ldap issues regarding seaborgium network blip for row C configuration change
Resolved	aborrero	T286614 Communicate wikireplicas outage and healthcheck the system after Eqiad Row C network changes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Marostegui updated the task description. (Show Details)Jul 3 2021, 5:24 AM

@Bstorm I assume we are ok with having a glitch on clouddb1017 and 1018?
@razzi dbstore1005 is on this rack.

Marostegui updated the task description. (Show Details)Jul 5 2021, 4:57 AM

Marostegui added a project: DBA.Jul 5 2021, 7:36 AM

Marostegui moved this task from Triage to Refine on the DBA board.Jul 5 2021, 7:38 AM

Looking at Ganeti VMs, they fall under three/four categories:

SPOF, will need a maintenance window declared:

an-tool1005
an-tool1007
etherpad1002 (or don't care, given that it's effort)
matomo1002

Will need to be depooled (or cordoned or failed over to the secondary instance)

acmechief1001
doh1001
kubemaster1002
kubernetes1006
poolcounter1005
ping1001
puppetdb1002 (we could stop Puppet through the maintenance window)
orespoolcounter1004
ncredir1001
registry1004
rpki1001
urldownloader1002

Some temporary unavailability is fine:

acmechief-test1001
doc1002
miscweb1002 ganeti1009 Service Operations
ml-etcd1002 ganeti1012 ML team
mwdebug1001 ganeti1010 Service Operations
mx1001
an-airflow1001
flowspec1001
idp-test1001
kubestagetcd1006
kubetcd1004
netflow1001
puppetboard1001
seaborgium

A failover/depool is needed in case Grafana/Logstash are needed during the maintenance:

logstash1009
logstash1025

MoritzMuehlenhoff updated the task description. (Show Details)Jul 5 2021, 9:39 AM

cmooney updated the task description. (Show Details)Jul 12 2021, 8:03 AM

In T286065#7194569, @Bstorm wrote:

@aborrero does cloudgw require manual failover?

it doesn't require manual failover, but we could depool (or force failover) the server beforehand to soften any potential operational impact.

cmooney updated the task description. (Show Details)Jul 12 2021, 9:06 AM

cmooney updated the task description. (Show Details)

cmooney updated the task description. (Show Details)Jul 12 2021, 9:11 AM

cmooney updated the task description. (Show Details)

cmooney updated the task description. (Show Details)Jul 12 2021, 9:15 AM

cmooney updated the task description. (Show Details)

MoritzMuehlenhoff updated the task description. (Show Details)Jul 12 2021, 9:29 AM

cmooney updated the task description. (Show Details)Jul 12 2021, 10:26 AM

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-c8-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1001 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure donors aren't impacted by any possible extended downtime.

In T286065#7205088, @cmooney wrote:

@Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cloudsw1-c8-eqiad, which will not be affected by this.

I've removed the entries for cloudcephosd / cloudvirt / cloudgw1001 from the above list now to correct this. Apologies for the confusion, but hope that makes things a little simpler for you.

That does!

fgiunchedi updated the task description. (Show Details)Jul 13 2021, 7:33 AM

fgiunchedi updated the task description. (Show Details)Jul 13 2021, 7:39 AM

cmooney updated the task description. (Show Details)Jul 13 2021, 9:25 AM

@Dwisehaupt I have to apologize for a newbie error I made here. The FR-Tech server's interfaces don't show what they're connected to in Netbox. I incorrectly assumed they were connected to the same ASWs as other devices in the row (albeit on separate vlans behind the PFWs).

Faidon spotted my error and put me right, the below hosts are actually connected to fasw-c1a-eqiad and fasw-c1b-eqiad, and thus won't be affected by this change. I've removed them from the list now, my apologies for the confusion, and thanks for taking the time to review.

civi1001
fran1001
frauth1001
frban1001
frbast1001
frdata1002
frdb1002
frdb1003
frdb1004
frdev1001
frlog1001
frmon1001
frmx1001
frnetmon1001
frpig1001
frpm1001
frqueue1003
frqueue1004
pay-lvs1001
pay-lvs1002
payments1001
payments1005
payments1007
payments1008

Volans updated the task description. (Show Details)Jul 13 2021, 9:45 AM

cmooney updated the task description. (Show Details)Jul 13 2021, 9:45 AM

Volans updated the task description. (Show Details)Jul 13 2021, 9:50 AM

cmooney updated the task description. (Show Details)Jul 13 2021, 10:16 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 13 2021, 11:18 AM

• ema updated the task description. (Show Details)Jul 13 2021, 12:40 PM

• ema updated the task description. (Show Details)

ssingh updated the task description. (Show Details)Jul 13 2021, 1:06 PM

Vgutierrez updated the task description. (Show Details)Jul 13 2021, 1:11 PM

@cmooney Not a problem. I have updated the task to remove the pre/post tasks we were looking at. Always good for us to think about failure scenarios anyway. :)

• Bstorm updated the task description. (Show Details)Jul 13 2021, 6:41 PM

Switching cloudmetrics to just eat the brief outage. I don't think it will be a big deal. We can just check it after.

The WMCS-owned dbproxy1018 and 1019 make up the entire cluster, so that's just an outage no matter what for wikireplicas. It will just need downtime, communication and checks afterward.

• Bstorm updated the task description. (Show Details)Jul 13 2021, 7:29 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 9:30 AM

cmooney updated the task description. (Show Details)Jul 14 2021, 11:43 AM

cmooney updated the task description. (Show Details)Jul 14 2021, 11:59 AM

cmooney updated the task description. (Show Details)Jul 14 2021, 12:09 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 12:54 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 1:01 PM

cmooney updated the task description. (Show Details)Jul 14 2021, 3:06 PM

cmooney updated the task description. (Show Details)

JMeybohm updated the task description. (Show Details)Jul 14 2021, 3:40 PM

JMeybohm updated the task description. (Show Details)Jul 14 2021, 3:48 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 9:27 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 10:10 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:06 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:14 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 11:45 AM

cmooney updated the task description. (Show Details)Jul 16 2021, 12:35 PM

cmooney updated the task description. (Show Details)

cmooney updated the task description. (Show Details)Jul 16 2021, 12:39 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 12:47 PM

cmooney updated the task description. (Show Details)Jul 16 2021, 4:15 PM

Change 705384 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] swap roles of dumpsdata1001, 1003 in prep for T286065

https://gerrit.wikimedia.org/r/705384

gerritbot added a project: Patch-For-Review.Jul 19 2021, 2:24 PM

cmooney updated the task description. (Show Details)Jul 19 2021, 2:38 PM

Change 705384 merged by ArielGlenn:

[operations/puppet@production] swap roles of dumpsdata1001, 1003 in prep for T286065

https://gerrit.wikimedia.org/r/705384

Change 705789 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Switch m3-master

https://gerrit.wikimedia.org/r/705789

Change 705789 merged by Marostegui:

[operations/dns@master] wmnet: Switch m3-master

https://gerrit.wikimedia.org/r/705789

I have switched m3-master from dbproxy1020 to dbproxy1016: https://gerrit.wikimedia.org/r/705789

Marostegui updated the task description. (Show Details)Jul 21 2021, 5:01 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2021, 7:05 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2021, 8:33 AM

cmooney updated the task description. (Show Details)Jul 21 2021, 8:51 AM

cmooney updated the task description. (Show Details)

fgiunchedi updated the task description. (Show Details)Jul 21 2021, 1:06 PM

cmooney updated the task description. (Show Details)Jul 22 2021, 9:02 AM

Mentioned in SAL (#wikimedia-operations) [2021-07-22T14:37:42Z] <mmandere> depool cp108[3-6].eqiad.wmnet - T286065

Icinga downtime set by mmandere@cumin2002 for 1:00:00 4 host(s) and their services with reason: Eqiad row C maintenance

cp[1083-1086].eqiad.wmnet

Vgutierrez updated the task description. (Show Details)Jul 22 2021, 2:42 PM

Mentioned in SAL (#wikimedia-operations) [2021-07-22T14:47:44Z] <mmandere> depool lvs1015 - T286065

herron updated the task description. (Show Details)Jul 22 2021, 2:49 PM

Icinga downtime set by mmandere@cumin2002 for 1:00:00 1 host(s) and their services with reason: Eqiad row C maintenance

lvs1015.eqiad.wmnet

Vgutierrez updated the task description. (Show Details)Jul 22 2021, 2:55 PM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 22 2021, 3:03 PM

aborrero closed subtask T286614: Communicate wikireplicas outage and healthcheck the system after Eqiad Row C network changes as Resolved.Jul 22 2021, 3:07 PM

herron updated the task description. (Show Details)Jul 22 2021, 3:10 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2021, 3:10 PM

Mentioned in SAL (#wikimedia-operations) [2021-07-22T15:11:20Z] <mmandere> pool cp108[3-6].eqiad.wmnet - T286065

Mentioned in SAL (#wikimedia-operations) [2021-07-22T15:14:55Z] <mmandere> pool lvs1015 - T286065

aborrero closed subtask T286601: Stop some services before and healthcheck labstore1004/5 following row C network change as Resolved.Jul 22 2021, 3:21 PM

cmooney updated the task description. (Show Details)Jul 22 2021, 3:22 PM

cmooney mentioned this in rOHPU29cc8cbafbbd: Adding flag for asw2-c-eqiad to configure class-of-service shared buffer.Jul 22 2021, 4:58 PM

All went very well with the change, this time I ran rapid ping from the CR to see if any packet loss was observed, and did detect some loss, but it was extremely brief

cmooney@re0.cr1-eqiad> ping 10.64.32.162 rapid count 100000   
PING 10.64.32.162 (10.64.32.162): 56 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<#####  output removed #####>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 10.64.32.162 ping statistics ---
100000 packets transmitted, 99998 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.262/0.503/35.295/0.519 ms

Resolving task.

cmooney closed this task as Resolved.Jul 22 2021, 5:02 PM

Change 708724 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] Swap dumpsdata rols back now that network maintenance is complete

https://gerrit.wikimedia.org/r/708724

gerritbot added a project: Patch-For-Review.Jul 29 2021, 8:40 AM

Change 708724 merged by ArielGlenn:

[operations/puppet@production] Swap dumpsdata rols back now that network maintenance is complete

https://gerrit.wikimedia.org/r/708724

taavi closed subtask T286613: check for ldap issues regarding seaborgium network blip for row C configuration change as Resolved.Aug 2 2021, 11:23 AM

Switch buffer re-partition - Eqiad Row CClosed, ResolvedPublicActions

Description

Days Before:

1h Before Window:

After The Change

Details

Related ObjectsSearch...

Event Timeline

Switch buffer re-partition - Eqiad Row C
Closed, ResolvedPublic
Actions

Related Objects
Search...