Page MenuHomePhabricator

GRE MTU mitigations - Tracking
Open, NormalPublic

Description

With the current GREs enabled in eqiad and esams, there have been numerous small issues with the reduced MTU of 1476: T218184 T232456 T232491 enwiki VP

We have a mitigation, which is not without tradeoffs and I don't feel comfortable turning on globally for all hosts (or even all eqiad + esams). We've been applying the mitigation manually, where it will be lost on next reboot naturally. This task tracks all the hosts mitigated, so that we can later ensure this is all reverted if/when we have a better solution.

The fix is to change the advmss parameter on the default route. This command turns on the mitigation idempotently:

ip -4 route replace $(ip -4 route show default) advmss 1436

And to remove it without rebooting would be:

ip -4 route replace $(ip -4 route show default) advmss 0

The current list of mitigated hosts, AFAIK, is:

  • archiva1001
  • install1002
  • cp1xxx (all)
  • cp3xxx (all)
  • cp5xxx (all)
  • cobalt (gerrit)
  • gerrit2001

Event Timeline

BBlack triaged this task as Normal priority.Wed, Sep 11, 12:09 PM
BBlack created this task.
Restricted Application added a project: Operations. · View Herald TranscriptWed, Sep 11, 12:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack updated the task description. (Show Details)Wed, Sep 11, 12:10 PM
elukey added a subscriber: elukey.Wed, Sep 11, 12:13 PM
jbond added a subscriber: jbond.Wed, Sep 11, 4:13 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-11T17:32:44Z] <bblack> enable GRE MTU mitigation on eqsin caches (cp5xxx) - T232602

BBlack updated the task description. (Show Details)Wed, Sep 11, 5:33 PM
Dzahn added a subscriber: Dzahn.Fri, Sep 13, 4:34 AM

Change 536401 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] puppetize setting advmss (MTU) size for GRE mitigations

https://gerrit.wikimedia.org/r/536401

Note that with new eqiad routing engines we can set the MSS at the router level (untested).
Advantages are: easier to deploy (one configuration change) and can be applied to external flows only, not all flows in/out of a server.
All DCs except esams should support it for now. (esams after the refresh).

Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct mitigation on the hosts where it matters for inbound (the cpNNNN, gerrit, etc, which probably also has a long tail of cases we haven't really noticed yet).

As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested.

Change 536401 abandoned by Dzahn:
puppetize setting advmss (MTU) size for GRE tunnel mitigations

Reason:
per Brandon "hoping we don't need to go down this road and we'll find better ways to deal with this". If that turns out to be untrue we can restore it.

https://gerrit.wikimedia.org/r/536401