Page MenuHomePhabricator

GRE MTU mitigations - Tracking
Closed, ResolvedPublic

Description

With the current GREs enabled in eqiad and esams, there have been numerous small issues with the reduced MTU of 1476: T218184 T232456 T232491 enwiki VP

We have a mitigation, which is not without tradeoffs and I don't feel comfortable turning on globally for all hosts (or even all eqiad + esams). We've been applying the mitigation manually, where it will be lost on next reboot naturally. This task tracks all the hosts mitigated, so that we can later ensure this is all reverted if/when we have a better solution.

The fix is to change the advmss parameter on the default route. This command turns on the mitigation idempotently:

ip -4 route replace $(ip -4 route show default) advmss 1436

And to remove it without rebooting would be:

ip -4 route replace $(ip -4 route show default) advmss 0

The current list of mitigated hosts, AFAIK, is:

  • archiva1001
  • install1002
  • cp1xxx (all)
  • cp3xxx (all)
  • cp5xxx (all)
  • cobalt (gerrit)
  • gerrit2001

Details

Related Gerrit Patches:

Event Timeline

BBlack triaged this task as Normal priority.Sep 11 2019, 12:09 PM
BBlack created this task.
Restricted Application added a project: Operations. · View Herald TranscriptSep 11 2019, 12:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack updated the task description. (Show Details)Sep 11 2019, 12:10 PM
elukey added a subscriber: elukey.Sep 11 2019, 12:13 PM
jbond added a subscriber: jbond.Sep 11 2019, 4:13 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-11T17:32:44Z] <bblack> enable GRE MTU mitigation on eqsin caches (cp5xxx) - T232602

BBlack updated the task description. (Show Details)Sep 11 2019, 5:33 PM
Dzahn added a subscriber: Dzahn.Sep 13 2019, 4:34 AM

Change 536401 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] puppetize setting advmss (MTU) size for GRE mitigations

https://gerrit.wikimedia.org/r/536401

Note that with new eqiad routing engines we can set the MSS at the router level (untested).
Advantages are: easier to deploy (one configuration change) and can be applied to external flows only, not all flows in/out of a server.
All DCs except esams should support it for now. (esams after the refresh).

Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct mitigation on the hosts where it matters for inbound (the cpNNNN, gerrit, etc, which probably also has a long tail of cases we haven't really noticed yet).

As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested.

Change 536401 abandoned by Dzahn:
puppetize setting advmss (MTU) size for GRE tunnel mitigations

Reason:
per Brandon "hoping we don't need to go down this road and we'll find better ways to deal with this". If that turns out to be untrue we can restore it.

https://gerrit.wikimedia.org/r/536401

Mentioned in SAL (#wikimedia-operations) [2019-09-19T18:12:51Z] <XioNoX> add TCP-MSS 1436 to cr1-eqiad external interfaces - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-19T18:14:45Z] <XioNoX> add TCP-MSS 1436 to cr2-eqiad external interfaces - T232602

ayounsi added a comment.EditedSep 19 2019, 6:30 PM
  • Setting tcp-mss on an interface causes all the BGP sessions going over that interface to bounce Reason: Interface change for the peer-group
  • As eqiad and codfw exchange a full view, some outbound eqiad traffic goes through codfw so we should clamp codfw/codfw too, but as it's very little traffic it might not be worth it. We might want to isolate eqiad/codfw more too later on.

@BBlack @faidon let me know when is a good time to remove that MSS hack on the routers.
To be done one router at a time with time in between for the sessions to re-establish. Will also drain NTT/Telia using BGP graceful shutdown beforehand.

CDanis added a subscriber: CDanis.Sep 25 2019, 6:48 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-25T21:56:49Z] <bblack> remove GRE MTU hacks on eqsin caches (cp5xxx) - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-25T21:57:40Z] <bblack> remove GRE MTU hacks on esams caches (cp3xxx) - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-25T21:58:16Z] <bblack> remove GRE MTU hacks on eqiad caches (cp1xxx) - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-25T21:59:20Z] <bblack> remove GRE MTU hacks on archiva1001 gerrit2001 cobalt install1002 - T232602

@BBlack @faidon let me know when is a good time to remove that MSS hack on the routers.
To be done one router at a time with time in between for the sessions to re-establish. Will also drain NTT/Telia using BGP graceful shutdown beforehand.

We're good to go to do this, please proceed with all appropriate caution :)

Mentioned in SAL (#wikimedia-operations) [2019-09-27T05:23:57Z] <XioNoX> remove tcp-mss clamping from cr1-eqiad - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-27T05:30:26Z] <XioNoX> remove tcp-mss clamping from cr2-eqord - T232602

Mentioned in SAL (#wikimedia-operations) [2019-09-27T05:42:31Z] <XioNoX> remove tcp-mss clamping from cr2-eqiad - T232602

BBlack closed this task as Resolved.Sep 27 2019, 4:50 PM