Page MenuHomePhabricator

Adjust egress buffer allocations on ToR switches
Closed, ResolvedPublic

Assigned To
Authored By
cmooney
Jun 8 2021, 6:40 PM
Referenced Files
F34651578: image.png
Sep 23 2021, 11:55 AM
F34651563: image.png
Sep 23 2021, 11:55 AM
F34651564: image.png
Sep 23 2021, 11:55 AM
F34651559: image.png
Sep 23 2021, 11:55 AM
F34651560: image.png
Sep 23 2021, 11:55 AM

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

Mitigation - Buffer Allocation

By default the switches reserve 50% of buffer memory for a "lossless" traffic class we don't use. It seems we can re-partition the space to dedicate the majority of it to best-effort instead.

https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/example/cos-shared-buffer-allocation-lossy-ucast-qfx-series-configuring.html

This change should be rolled out to all EX/QFX switches across the network. Scheduled as follows:

Eqiad - Completed without issue.
RowDateTimeTaskStatus
DTues July 20th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286069Complete - No Issues
CThurs July 22nd 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286065Complete - No Issues
BTues July 27th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286061Complete - No Issues
AThurs July 29th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286032Complete - No Issues
CODFW - TBC

To be scheduled following DC switchover back to eqiad in September.

CloudSW - Eqiad

The same issue with outbound drops is not insignificant across multiple interfaces of the "cloudsw" devices connecting WMCS endpoints in eqiad. The following two switches require the same change:

SwitchDateTimeTaskStatus
cloudsw1-c8-eqiadThurs Aug 5th 202110:00 UTCT288036
cloudsw1-d5-eqiadThurs Aug 5th 202111:00 UTCT288037

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
Resolvedcmooney
ResolvedNone
ResolvedKormat
ResolvedJclark-ctr
ResolvedMarostegui
ResolvedMarostegui
DeclinedNone
DeclinedNone
ResolvedNone
DeclinedNone
DeclinedNone
ResolvedNone
Resolvedaborrero
ResolvedNone
Resolvedaborrero
ResolvedNone
ResolvedAndrew
ResolvedBstorm
Resolvedaborrero
ResolvedNone
ResolvedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jbond triaged this task as Medium priority.Jun 21 2021, 2:40 PM

Change 701499 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

cmooney updated the task description. (Show Details)

What is the expected length of service interupption for any of these days? I'm looking on the impact on the dumpsdata/snapshot hosts, and depending on the legth of time of the outage, we might be able to get by with some minor shuffling of hosts around.

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

This is quite good, even taking into account the risk that something goes awry. Thanks!

@cmooney I haven't been able to get ahold of you this week, so leaving the comment I left on IRC here:
My preferred order for the switches maintenance would be: row d, c, b a (again, this is what would work best for dbas, as it would give us more time to work on row a and row b replacements)

@Marostegui Ok thanks for the comments. I've not been feeling so good so hadn't been online.

Will review Monday against feedback from other teams but I'm sure we can accommodate. Also please advise if you expect timelines to be workable, or if they are a little tight we can look at pushing out so everyone has time to prepare.

I hope you get better soon. I am off next week but someone from the team will contact you next week.
From my point of view, I think if we follow that row order, we should be ok with the given dates.

cmooney updated the task description. (Show Details)

With the new schedule I think I can swap one dumpsdata host and one snapshot host and avoid any impact whatsoever on XMl/SQL dumps. This is great, thank you!

@cmooney should we sent out an email about this to ops@ and possibly add those times/dates to the maintenance calendar? Thank you!

@jijiki thanks yes good suggestions both. I will send a mail to ops@ later today as a reminder for people to review.

In terms of maintenance do you mean the "Ops vendor maintenance" one? Or something else? I tried to add it to that one but I don't think I have permissions, I can probably sort that I'm sure though.

Change 701499 merged by jenkins-bot:

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

Change 705722 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

Change 705722 merged by jenkins-bot:

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

Change 706568 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568

Change 706568 merged by jenkins-bot:

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568

Change 708784 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032

https://gerrit.wikimedia.org/r/708784

cmooney updated the task description. (Show Details)

Change 708784 merged by jenkins-bot:

[operations/homer/public@master] Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032

https://gerrit.wikimedia.org/r/708784

Mentioned in SAL (#wikimedia-operations) [2021-07-30T08:56:08Z] <topranks> running homer against asw2-a-eqiad and asw2-b-eqiad to bring homer in line with manual config added for buffer mem. T284592

Ok well we're about a week after DC switchover back to eqiad so we can make some conclusions on the results of the changes in eqiad.

Overall there definitely seem to be less discards on the relevant links in the past week, compared to earlier in the year. In some cases the graphs are somewhat skewed by unrelated bursts of activity, but in general the pattern is fairly clear (comparing the last week with stats prior to July):

asw2-a-eqiad:

image.png (742×1 px, 116 KB)

asw2-b-eqiad:

image.png (724×1 px, 100 KB)

asw2-c-eqiad:

image.png (720×1 px, 371 KB)

asw2-d-eqiad:

image.png (724×1 px, 388 KB)

The original issue that made us look at this was low throughput on the regular transfers from eqiad backup hosts to those in codfw. Checking the most recent graphs it seems speeds for these have not degraded back to the levels observed in May/June. This is despite link usage in eqiad returning to normal following last weeks DC switch back, and the resulting increase in discards on the switch->cr links (albeit less than before as shown above):

image.png (840×1 px, 115 KB)

We could leave it further time to allow more time for the new pattern to emerge, but broadly I think we can say:

  • The change has definitely reduced the number of discards we observe day-to-day on these links.
  • The reduction has improved real-world performance as seen for the backup hosts.
  • There is still a significant number of drops, which will unquestionably be causing degraded performance for services.

I am creating a parent task for this, so we can track the overall issue of these drops, and any other actions we may wish to take to address the continuing problem.

Change 929689 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove optional var to set COS buffers for QFX/EX switches

https://gerrit.wikimedia.org/r/929689

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:42:47Z] <topranks> adjusting port buffer partition asw-a-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:46:01Z] <topranks> adjusting port buffer partition asw-b-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:49:33Z] <topranks> adjusting port buffer partition asw-c-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:53:11Z] <topranks> adjusting port buffer partition asw-d-codfw T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:57:24Z] <topranks> adjusting port buffer partition asw2-ulsfo T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:58:24Z] <topranks> adjusting port buffer partition asw1-eqsin T284592

Mentioned in SAL (#wikimedia-operations) [2023-06-14T13:59:38Z] <topranks> adjusting port buffer partition asw2-esams T284592

Change 929689 merged by jenkins-bot:

[operations/homer/public@master] Remove optional var to set COS buffers for QFX/EX switches

https://gerrit.wikimedia.org/r/929689

cmooney claimed this task.

Change is now live on all relevant Juniper devices.

Change 930754 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Do not push class-of-service buffer partition to ex4300

https://gerrit.wikimedia.org/r/930754

Change 930754 merged by jenkins-bot:

[operations/homer/public@master] Do not push class-of-service buffer partition to ex4300

https://gerrit.wikimedia.org/r/930754