Page MenuHomePhabricator

Anycast: consistent ICMP packet too big routing
Open, LowPublic

Description

Creating a dedicated task from T253666#6166521

Related: we have the issue of ICMP Packet-Too-Big routing: AFAIK Juniper doesn't even try to route a PTB from an intermediate router to the same server as the primary traffic it was referencing. This probably isn't a major issue for the authdns case, because (a) the client recursors should mostly be on server (rather than eyeball) networks with full MTU + (b) the overwhelming majority of all traffic is UDP with small-enough packet sizes to fit any reasonable network. However, it would be nice to be correct for edge cases like recursors in eyeball networks with MTU problems, and future-proof against increasing TCP usage in the future (for cookie init and other blind-injection-avoidance, and also DoTLS and future DNSSEC packet size increases). Cloudflare's generic answer to this problem has been https://github.com/cloudflare/pmtud , but there might be different and/or simpler approaches we want to try as well.

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

pmtud send the packets to the broadcast MAC address, which mean it only works within the same subnet. While we have hosts on different subnets (rows) in the core DCs.
However, and slightly related to what is discussed in https://github.com/cloudflare/pmtud/pull/3 it might be possible to forward the packet to a multicast address instead (or kafka). With the proper safeguards so we don't create a loop.

Juniper opened Enhancement Request ER-081995 to address it.
Not sure that will be useful to us by the time it's implemented, but at least it could help people in the future.

Some notes from last week's IRC chat:

  • It would be useful to count the ICMP PTB messages we received (regardless of if they arrive on the correct server or not) so we know
    • How much it's impacting us.
    • How to fine tune the MTU.
    • Unfortunately, "the standard linux snmp stats counters only give the ICMP Type, but not the subcode"
    • As a workaround have an iptables rule with a matching prometheus-exporter even though overkill. Or a tcpdump exporter/wrapper too.
  • Cloudflare's pmtud requires all the real servers to be in the same vlan (broadcast)
    • This one might be a good alternative as it uses IPIP encapsulation (unicast)

In case it is useful: a lighter weight (but one that we have to maintain ourselves) solution for prometheus would be to use node-exporter's textfile collector to periodically export the iptables counters for said rules

Thanks @ayounsi, That exaring project looks to be a fairly sensible approach alright, albeit fairly new. Might be worth testing out.

I note our NS seem to use ENDS payload size of 1024? Which is fairly conservative, and probably ensures we dodge a lot of potential MTU blackholes. We could consider clamping TCP responses with a similarly low MSS, to try to minimize problems (bear in mind many networks block ICMP so PMTUd isn't always going to work, even if we can ensure these packets get to the right box). While inefficient the number of large DNS responses is currently quite low, so proportionally this should remain a small amount of traffic.

But firstly I think the suggestion to measure how much of this we see makes sense. My gut feeling is that an iptables rule will be more performant than anything with tcpdump, but I've not looked into it.