Page MenuHomePhabricator

Add network-layer protections to avoid inadvertently lowering IRB MTU
Closed, ResolvedPublic

Description

The recent cloud services outage was triggered by issues the WMCS network storage (ceph) clusters had, caused by a lowered MTU on a L3 switch IRB interface.

While the particular issue exhibited signs of a bug rather than exactly what Juniper say ought to happen, it's clear there is a risk that if an MTU is not set on a L2 interface in Netbox, the switch port will not have one assigned, defaulting to 1514 and potentially causing issues with routing of jumbo frames.

Medium-term we should accelerate the work on T310590, to enforce a constraint that MTUs have to be set (or even set to a particular value), for L2 access/trunk ports. In the meantime it may be worth adding some other mitigations to prevent a mis-configuration similar to the one that caused the cloud outage.

Creating this task to track options and changes towards that end.

Event Timeline

cmooney triaged this task as Medium priority.Feb 15 2023, 9:58 PM
cmooney created this task.

Change 889635 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Default L2 interfaces to MTU 9212 if not set from Netbox

https://gerrit.wikimedia.org/r/889635

The above patch addresses the issue by ensuring Homer adds an MTU of 9192 on any L2 switch ports which don't have a specific MTU set in Netbox.

It's a no-op on our current siwtches, see P44676.

I took this approach as dealing with the puppet import script seemed trickier. I had some code working for that, however testing that it would work in all scenarios for that seemed trickier. Given the uncertainty this approach seemed more straightforward, and would also catch *any* instance where the an L2 port has no MTU set (not just if the puppet import script failed to do so).

Happy to discuss other options.

cmooney updated the task description. (Show Details)

Change 889635 merged by jenkins-bot:

[operations/homer/public@master] Default L2 interfaces to MTU 9212 if not set from Netbox

https://gerrit.wikimedia.org/r/889635

Change 893690 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/netbox-extras@master] Set switch-side MTU to 9192 for discovered links from servers

https://gerrit.wikimedia.org/r/893690

Change 893690 merged by jenkins-bot:

[operations/software/netbox-extras@master] Set switch-side MTU to 9192 for discovered links from servers

https://gerrit.wikimedia.org/r/893690