Page MenuHomePhabricator

Adjust egress buffer allocations on ToR switches
Open, MediumPublic

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

Mitigation - Buffer Allocation

By default the switches reserve 50% of buffer memory for a "lossless" traffic class we don't use. It seems we can re-partition the space to dedicate the majority of it to best-effort instead.

https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/example/cos-shared-buffer-allocation-lossy-ucast-qfx-series-configuring.html

This change should be rolled out to all EX/QFX switches across the network. Provisionally scheduled as follows:

RowDateTimeTaskStatus
DTues July 20th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286069Complete - No Issues
CThurs July 22nd 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286065Complete - No Issues
BTues July 27th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286061
AThurs July 29th 202115:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)T286032

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
StalledKormat
ResolvedJclark-ctr
ResolvedMarostegui
StalledNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
ResolvedNone
ResolvedAndrew
ResolvedBstorm
Resolvedaborrero

Event Timeline

jbond triaged this task as Medium priority.Jun 21 2021, 2:40 PM

Change 701499 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

cmooney updated the task description. (Show Details)

What is the expected length of service interupption for any of these days? I'm looking on the impact on the dumpsdata/snapshot hosts, and depending on the legth of time of the outage, we might be able to get by with some minor shuffling of hosts around.

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

I’ve no reason to think anything other than what’s in the child tasks at this point. Having put out feelers externally, I’ve got some anecdotal reports that a few seconds is correct to expect.

The exact duration of the impact is unknown at this time - we hope to be able to test on some real switches before the date and get a firm indication. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance.

This is quite good, even taking into account the risk that something goes awry. Thanks!

@cmooney I haven't been able to get ahold of you this week, so leaving the comment I left on IRC here:
My preferred order for the switches maintenance would be: row d, c, b a (again, this is what would work best for dbas, as it would give us more time to work on row a and row b replacements)

@Marostegui Ok thanks for the comments. I've not been feeling so good so hadn't been online.

Will review Monday against feedback from other teams but I'm sure we can accommodate. Also please advise if you expect timelines to be workable, or if they are a little tight we can look at pushing out so everyone has time to prepare.

I hope you get better soon. I am off next week but someone from the team will contact you next week.
From my point of view, I think if we follow that row order, we should be ok with the given dates.

cmooney updated the task description. (Show Details)

With the new schedule I think I can swap one dumpsdata host and one snapshot host and avoid any impact whatsoever on XMl/SQL dumps. This is great, thank you!

@cmooney should we sent out an email about this to ops@ and possibly add those times/dates to the maintenance calendar? Thank you!

@jijiki thanks yes good suggestions both. I will send a mail to ops@ later today as a reminder for people to review.

In terms of maintenance do you mean the "Ops vendor maintenance" one? Or something else? I tried to add it to that one but I don't think I have permissions, I can probably sort that I'm sure though.

Change 701499 merged by jenkins-bot:

[operations/homer/public@master] Adding 'quality-of-service' template for use on QFX/EX series switches.

https://gerrit.wikimedia.org/r/701499

Change 705722 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

Change 705722 merged by jenkins-bot:

[operations/homer/public@master] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here.

https://gerrit.wikimedia.org/r/705722

Change 706568 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568

Change 706568 merged by jenkins-bot:

[operations/homer/public@master] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065.

https://gerrit.wikimedia.org/r/706568