Adjust egress buffer allocations on ToR switches
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Jun 8 2021, 6:40 PM

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

Mitigation - Buffer Allocation

By default the switches reserve 50% of buffer memory for a "lossless" traffic class we don't use. It seems we can re-partition the space to dedicate the majority of it to best-effort instead.

https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/example/cos-shared-buffer-allocation-lossy-ucast-qfx-series-configuring.html

This change should be rolled out to all EX/QFX switches across the network. Scheduled as follows:

Eqiad - Completed without issue.

Row	Date	Time	Task	Status
D	~~Tues July 20th 2021~~	~~15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~	~~T286069~~	Complete - No Issues
C	~~Thurs July 22nd 2021~~	~~15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~	~~T286065~~	Complete - No Issues
B	~~Tues July 27th 2021~~	~~15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~	~~T286061~~	Complete - No Issues
A	~~Thurs July 29th 2021~~	~~15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)~~	~~T286032~~	Complete - No Issues

CODFW - TBC

To be scheduled following DC switchover back to eqiad in September.

CloudSW - Eqiad

The same issue with outbound drops is not insignificant across multiple interfaces of the "cloudsw" devices connecting WMCS endpoints in eqiad. The following two switches require the same change:

Switch	Date	Time	Task	Status
~~cloudsw1-c8-eqiad~~	~~Thurs Aug 5th 2021~~	~~10:00 UTC~~	T288036
~~cloudsw1-d5-eqiad~~	~~Thurs Aug 5th 2021~~	~~11:00 UTC~~	T288037

Details

Subject	Repo	Branch	Lines +/-
Do not push class-of-service buffer partition to ex4300	operations/homer/public	master	+1 -1
Remove optional var to set COS buffers for QFX/EX switches	operations/homer/public	master	+1 -30
Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032	operations/homer/public	master	+3 -0
Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.	operations/homer/public	master	+2 -0
Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.	operations/homer/public	master	+0 -0
Adding 'quality-of-service' template for use on QFX/EX series switches.	operations/homer/public	master	+39 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	cmooney	T284592 Adjust egress buffer allocations on ToR switches
Resolved	None	T286032 Switch buffer re-partition - Eqiad Row A
Resolved	• Kormat	T284622 Rename dbstore1004 to db1183 and place it on m5
Resolved	Jclark-ctr	T286468 Relabel dbstore1004 to db1183
Resolved	• Marostegui	T286042 Move db1124 and db1125 to misc services temporarily
Resolved	• Marostegui	T286329 Move db1124 and db1125 back to test-cluster section
Declined	None	T286063 Fail over dumps web services to labstore1007 prior to July 20th network disruption
Declined	None	T286064 Fail over clouddb1013 and clouddb1014 prior to network disruption on Row A
Resolved	None	T286061 Switch buffer re-partition - Eqiad Row B
Declined	None	T286615 Widespread cloud ceph and hypervisor issues possible with reconfiguration of Eqiad Row B
Declined	None	T286616 Fail over clouddb1015 and clouddb1016 for network switch changes
Resolved	None	T286065 Switch buffer re-partition - Eqiad Row C
Resolved	aborrero	T286601 Stop some services before and healthcheck labstore1004/5 following row C network change
Resolved	None	T286613 check for ldap issues regarding seaborgium network blip for row C configuration change
Resolved	aborrero	T286614 Communicate wikireplicas outage and healthcheck the system after Eqiad Row C network changes
Resolved	None	T286069 Switch buffer re-partition - Eqiad Row D
Resolved	Andrew	T286598 Fail over clouddb1019, clouddb1020 for switch changes
Resolved	• Bstorm	T286599 Downtime? and healthcheck cloudstore1008/9 following row D network change
Resolved	aborrero	T286600 failover cloud NFS from labstore1007 to labstore1006
Resolved	None	T288036 Switch buffer re-partition - cloudsw1-c8-eqiad
Resolved	None	T288037 Switch buffer re-partition - cloudsw1-d5-eqiad

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2021, 6:40 PM

ayounsi mentioned this in T277340: (Need By: TBD) rack/setup/install (2) new 10G switches.Jun 10 2021, 8:43 AM

Maintenance_bot added a project: SRE.Jun 10 2021, 8:45 AM

jbond triaged this task as Medium priority.Jun 21 2021, 2:40 PM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

LSobanski subscribed.Jun 24 2021, 10:50 AM

Change 701499 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

gerritbot added a project: Patch-For-Review.Jun 25 2021, 10:02 AM

cmooney updated the task description. (Show Details)Jul 2 2021, 10:11 AM

cmooney updated the task description. (Show Details)Jul 2 2021, 11:41 AM

cmooney updated the task description. (Show Details)

ArielGlenn subscribed.Jul 2 2021, 3:04 PM

cmooney updated the task description. (Show Details)Jul 2 2021, 3:40 PM

cmooney updated the task description. (Show Details)Jul 2 2021, 4:28 PM

cmooney updated the task description. (Show Details)Jul 2 2021, 4:55 PM

What is the expected length of service interupption for any of these days? I'm looking on the impact on the dumpsdata/snapshot hosts, and depending on the legth of time of the outage, we might be able to get by with some minor shuffling of hosts around.

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

In T284592#7198669, @cmooney wrote:

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

This is quite good, even taking into account the risk that something goes awry. Thanks!

@cmooney I haven't been able to get ahold of you this week, so leaving the comment I left on IRC here:
My preferred order for the switches maintenance would be: row d, c, b a (again, this is what would work best for dbas, as it would give us more time to work on row a and row b replacements)

@Marostegui Ok thanks for the comments. I've not been feeling so good so hadn't been online.

Will review Monday against feedback from other teams but I'm sure we can accommodate. Also please advise if you expect timelines to be workable, or if they are a little tight we can look at pushing out so everyone has time to prepare.

I hope you get better soon. I am off next week but someone from the team will contact you next week.
From my point of view, I think if we follow that row order, we should be ok with the given dates.

cmooney updated the task description. (Show Details)Jul 12 2021, 8:04 AM

cmooney updated the task description. (Show Details)

With the new schedule I think I can swap one dumpsdata host and one snapshot host and avoid any impact whatsoever on XMl/SQL dumps. This is great, thank you!

BTullis subscribed.Jul 12 2021, 9:58 AM

@cmooney should we sent out an email about this to ops@ and possibly add those times/dates to the maintenance calendar? Thank you!

@jijiki thanks yes good suggestions both. I will send a mail to ops@ later today as a reminder for people to review.

In terms of maintenance do you mean the "Ops vendor maintenance" one? Or something else? I tried to add it to that one but I don't think I have permissions, I can probably sort that I'm sure though.

cmooney updated the task description. (Show Details)Jul 20 2021, 3:53 PM

Change 701499 merged by jenkins-bot:

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

jenkins-bot mentioned this in rOHPUec153cf088ca: Adding 'quality-of-service' template for use on QFX/EX series switches..Jul 20 2021, 4:35 PM

Change 705722 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

Change 705722 merged by jenkins-bot:

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

cmooney mentioned this in rOHPU37844befc068: Previous change had incorrect file extention on the 'class-of-service' config….Jul 20 2021, 5:01 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2021, 5:10 PM

cmooney closed subtask T286069: Switch buffer re-partition - Eqiad Row D as Resolved.Jul 21 2021, 8:35 AM

RhinosF1 subscribed.Jul 21 2021, 8:38 AM

cmooney mentioned this in T274234: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002.Jul 22 2021, 9:25 AM

cmooney updated the task description. (Show Details)Jul 22 2021, 3:23 PM

Change 706568 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568

gerritbot added a project: Patch-For-Review.Jul 22 2021, 4:17 PM

Change 706568 merged by jenkins-bot:

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568

cmooney mentioned this in rOHPU29cc8cbafbbd: Adding flag for asw2-c-eqiad to configure class-of-service shared buffer.Jul 22 2021, 4:58 PM

cmooney closed subtask T286065: Switch buffer re-partition - Eqiad Row C as Resolved.Jul 22 2021, 5:02 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2021, 5:10 PM

cmooney updated the task description. (Show Details)Jul 27 2021, 3:19 PM

cmooney closed subtask T286061: Switch buffer re-partition - Eqiad Row B as Resolved.Jul 27 2021, 5:09 PM

Change 708784 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032

https://gerrit.wikimedia.org/r/708784

gerritbot added a project: Patch-For-Review.Jul 29 2021, 3:23 PM

cmooney updated the task description. (Show Details)Jul 29 2021, 3:27 PM

cmooney updated the task description. (Show Details)

cmooney closed subtask T286032: Switch buffer re-partition - Eqiad Row A as Resolved.Jul 30 2021, 8:12 AM

Change 708784 merged by jenkins-bot:

https://gerrit.wikimedia.org/r/708784

cmooney mentioned this in rOHPU6b2fd5233929: Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service.Jul 30 2021, 8:41 AM

Mentioned in SAL (#wikimedia-operations) [2021-07-30T08:56:08Z] <topranks> running homer against asw2-a-eqiad and asw2-b-eqiad to bring homer in line with manual config added for buffer mem. T284592

Maintenance_bot removed a project: Patch-For-Review.Jul 30 2021, 9:10 AM

cmooney added a subtask: T288036: Switch buffer re-partition - cloudsw1-c8-eqiad.Aug 4 2021, 8:29 AM

cmooney added a subtask: T288037: Switch buffer re-partition - cloudsw1-d5-eqiad.Aug 4 2021, 8:32 AM

cmooney updated the task description. (Show Details)Aug 4 2021, 8:41 AM

cmooney updated the task description. (Show Details)Aug 5 2021, 9:59 AM

cmooney closed subtask T288036: Switch buffer re-partition - cloudsw1-c8-eqiad as Resolved.Aug 5 2021, 10:29 AM

cmooney closed subtask T288037: Switch buffer re-partition - cloudsw1-d5-eqiad as Resolved.Aug 5 2021, 11:23 AM

cmooney updated the task description. (Show Details)Aug 5 2021, 11:25 AM

ayounsi moved this task from Backlog to In Progress on the Infrastructure-Foundations board.Aug 12 2021, 9:23 AM

cmooney mentioned this in T291385: TCP retransmissions in eqiad and codfw.Sep 20 2021, 2:13 PM

ayounsi moved this task from In Progress to Up Next on the Infrastructure-Foundations board.Sep 21 2021, 1:11 PM

Ok well we're about a week after DC switchover back to eqiad so we can make some conclusions on the results of the changes in eqiad.

Overall there definitely seem to be less discards on the relevant links in the past week, compared to earlier in the year. In some cases the graphs are somewhat skewed by unrelated bursts of activity, but in general the pattern is fairly clear (comparing the last week with stats prior to July):

asw2-a-eqiad:

asw2-b-eqiad:

asw2-c-eqiad:

asw2-d-eqiad:

The original issue that made us look at this was low throughput on the regular transfers from eqiad backup hosts to those in codfw. Checking the most recent graphs it seems speeds for these have not degraded back to the levels observed in May/June. This is despite link usage in eqiad returning to normal following last weeks DC switch back, and the resulting increase in discards on the switch->cr links (albeit less than before as shown above):

We could leave it further time to allow more time for the new pattern to emerge, but broadly I think we can say:

The change has definitely reduced the number of discards we observe day-to-day on these links.
The reduction has improved real-world performance as seen for the backup hosts.
There is still a significant number of drops, which will unquestionably be causing degraded performance for services.

I am creating a parent task for this, so we can track the overall issue of these drops, and any other actions we may wish to take to address the continuing problem.

cmooney mentioned this in T291627: Packet Drops on Eqiad ASW -> CR uplinks.Sep 23 2021, 12:02 PM

cmooney added a parent task: T291627: Packet Drops on Eqiad ASW -> CR uplinks.

ayounsi moved this task from Backlog to This quarter on the netops board.Jul 5 2022, 6:27 AM

joanna_borun moved this task from Up Next to Backlog on the Infrastructure-Foundations board.May 26 2023, 2:08 PM

Change 929689 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove optional var to set COS buffers for QFX/EX switches

https://gerrit.wikimedia.org/r/929689

gerritbot added a project: Patch-For-Review.Jun 13 2023, 12:10 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:42:47Z] <topranks> adjusting port buffer partition asw-a-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:46:01Z] <topranks> adjusting port buffer partition asw-b-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:49:33Z] <topranks> adjusting port buffer partition asw-c-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:53:11Z] <topranks> adjusting port buffer partition asw-d-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:57:24Z] <topranks> adjusting port buffer partition asw2-ulsfo T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:58:24Z] <topranks> adjusting port buffer partition asw1-eqsin T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:59:38Z] <topranks> adjusting port buffer partition asw2-esams T284592

Change 929689 merged by jenkins-bot:

[operations/homer/public@master] Remove optional var to set COS buffers for QFX/EX switches

https://gerrit.wikimedia.org/r/929689

Maintenance_bot removed a project: Patch-For-Review.Jun 14 2023, 2:10 PM

Change is now live on all relevant Juniper devices.

Change 930754 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Do not push class-of-service buffer partition to ex4300

https://gerrit.wikimedia.org/r/930754

gerritbot added a project: Patch-For-Review.Jun 16 2023, 9:22 AM

Change 930754 merged by jenkins-bot:

[operations/homer/public@master] Do not push class-of-service buffer partition to ex4300

https://gerrit.wikimedia.org/r/930754

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2023, 11:10 AM

	F34651578: image.png
	Sep 23 2021, 11:55 AM

	F34651563: image.png
	Sep 23 2021, 11:55 AM

	F34651564: image.png
	Sep 23 2021, 11:55 AM

	F34651559: image.png
	Sep 23 2021, 11:55 AM

Adjust egress buffer allocations on ToR switchesClosed, ResolvedPublicActions

Description

Eqiad - Completed without issue.

CODFW - TBC

CloudSW - Eqiad

Details

Related ObjectsSearch...

Event Timeline

Adjust egress buffer allocations on ToR switches
Closed, ResolvedPublic
Actions

Related Objects
Search...