Page MenuHomePhabricator

MTU setting in IPv6 VMs causes issues with Docker
Closed, ResolvedPublic

Description

IPv6-enabled VMs in Cloud VPS have mtu=1450 on their network interface. This is a problem if you run Docker in those VMs, because Docker assumes the MTU to be 1500. This can lead to network errors like seen in T405742: tofu-provisioning: Failed to install provider.

We should either:

  • find a way to raise the MTU to 1500 on all VMs
  • make sure that Docker uses the right MTU setting on all VMs

In gitlab-runners VMs this was fixed by modifying /etc/docker/daemon.json, see patches https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196493 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196929.

In other VMs, this is still an issue, for example I verified I can reproduce the network errors in tools-harbor-2

root@tools-harbor-2:~# docker run --rm -it docker-registry.wikimedia.org/bookworm

[...]

root@ca1585e47d54:/# TESTURL='https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.3.2/terraform-provider-openstack_3.3.2_SHA256SUMS'
root@ca1585e47d54:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null -L --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
...F.F..FF..F..FF..F.F....FF.FFFF..FFFF..F.F....F.

Running apt update and apt install to install curl also failed randomly a few times before I could execute the test above. I assume that was also caused by the MTU mismatch.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Set MTU 1500 for VXLAN networks in eqiad1repos/cloud/cloud-vps/tofu-infra!282taavimain-I67313abded6df9eb4a32abcdf0815a37faef0841main
Set MTU 1500 for VXLAN networks in codfw1devrepos/cloud/cloud-vps/tofu-infra!280taavimain-I6f157357ba569e5f30fd258dde19b00776b2a31fmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm +1 on getting docker's puppetization to do the right thing, in other words detect the default route's mtu and set said value as docker's default.

Docker actually used to auto-detect default route MTU, though that was removed in https://github.com/moby/moby/pull/18108 since in general it is a brittle mechanism. In our case though I think auto-detecting via puppet will be fine since we know and control the environment

cc-ing @cmooney for networking considerations

Docker actually used to auto-detect default route MTU, though that was removed in https://github.com/moby/moby/pull/18108

It's interesting that in that issue they seem to think having a higher MTU in Docker should not cause issues: "The kernel performs path MTU discovery to resolve this exact situation". Why is this not working in our case?

A possible explanation is in this more recent discussion: https://github.com/moby/moby/issues/49398#issuecomment-2636987625. I wonder if the workaround mentioned there would work in our case:

Configure MSS-Clamping on VM1 (example) to fix the TCP MSS manually. This is a very common workaround for exactly the problem, that large parts of the Internet don't support PMTUD correctly. But this only helps for TCP traffic, not UDP.

A major downside of doing it via Puppet is that that won't have any effect on instances with non-Puppetized services (so basically anything not managed by WMF SREs).

fnegri triaged this task as Medium priority.Oct 29 2025, 3:05 PM

A major downside of doing it via Puppet is that that won't have any effect on instances with non-Puppetized services (so basically anything not managed by WMF SREs).

Great point, I'm now convinced if we go down the MSS clamping route (hah!) we should be doing it ideally at the neutron level if at all possible (in other words, being able to tell neutron "wherever you are deploying this network, also deploy mss clamping")

I think this also applies to IPv4 on those networks? The problem isn't IPv6 as such, but due to the overhead the VXLAN encapsulation when that is used to tunnel packets from cloudvirt to cloudnet hosts. Right now those hosts have a 1500-byte MTU on their primary interface, so the additional bytes required by the VXLAN header have to eat into that, and thus reduce the MTU that can be offered to instances.

Myself and Arturo did discuss this during the roll out, and the option of moving to jumbo-frame support on cloudvirt/cloudnet to allow the VXLAN-tunneled packets be larger than 1500 on the wire. After testing we found Neutron was correctly setting the MTU in instances to less than 1500, accounting for the VXLAN overhead, and tests went well so we avoided the complexity of increasing MTU on the physical host ints.

It's interesting that in that issue they seem to think having a higher MTU in Docker should not cause issues: "The kernel performs path MTU discovery to resolve this exact situation". Why is this not working in our case?

Path MTU should indeed work. It relies on ICMP packets being generated by hosts which find their MTU lower than packets trying to be sent, and those ICMPs getting back to the sender so they can re-try with smaller ones.

In this case it would require either the cloudnet or cloudvirt hosts to send an ICMP "fragmentation needed" packet back to (for instance) apt.wikimedia.org when the HTTP server sends a packet bigger than the instance can receive. The ICMPs need to be allowed back and not dropped by any firewall/acl, but I don't believe that should be an issue for us (on the internet it can't be relied on for this reason).

Those ICMPs also need to work in the other direction, VMs should send an ICMP back to a docker container over its veth link. But the problem is normally more evident in the other direction. When a small client request packet generates a big server response, it's the big response that doesn't get back.

A possible explanation is in this more recent discussion: https://github.com/moby/moby/issues/49398#issuecomment-2636987625. I wonder if the workaround mentioned there would work in our case:

Configure MSS-Clamping on VM1 (example) to fix the TCP MSS manually. This is a very common workaround for exactly the problem, that large parts of the Internet don't support PMTUD correctly. But this only helps for TCP traffic, not UDP.

MSS clamping is not perfect, but for the most part works well. Even if pmtud works it is preferable as it starts the connection with a workable MTU, instead of waiting for it to fail and then re-trying lower. You could absolutely insert a rule in nftables on the VM to ensure that SYN packets in either direction have their MTU capped.

I would rate the options in terms of fixing it in this order:

  1. Move cloudvirt/cloudnet to support jumbo frames ensuring VMs can have a 1500 byte MTU
    • The potential problem here is if these hosts send 9000-byte frames to other things in our infra that get dropped
    • MSS clamping, path-mtu-discovery etc in theory can overcome these problems
    • It's a delicate thing to experiment with in production
  2. Have docker correctly set the MTU on the container-side of the veth pairs it creates, based on the VM interface/route MTU
  3. Use MSS clamping on the VM and work to ensure all ICMPs are allowed to support path-mtu-discovery
    1. Keeps the mis-match but hopefully those other tactics ensure it won't be an issue

I would rate the options in terms of fixing it in this order:

Sorry I should have said that's my rating in terms of effectiveness / best possible solution.

In terms of actions for the short/medium-term I think we're probably best investigating option 2. Is it possible to add an /etc/docker/daemon.json file to these VMs with "mtu": 1450 in it?

Thanks @cmooney for the detailed analysis.

I think this also applies to IPv4 on those networks? The problem isn't IPv6 as such, but due to the overhead the VXLAN encapsulation

Yes it does apply to IPv4 packets, but I can only reproduce this issue on IPv6-enabled VMs. Is it because we don't use VXLAN on the old IPv4-only VMs?

Is it possible to add an /etc/docker/daemon.json file to these VMs with "mtu": 1450 in it?

I think this can be done easily, but it has two downsides:

  • it will not fix puppet-less VMs, as @taavi mentioned above
  • it will not fix custom docker networks (docker network create), for those you need default-network-opts in daemon.json, which is only available in Docker >=v24.0 (so only on Debian Trixie)

Fixing this with /etc/docker/daemon.json is an incomplete fix, but we can still consider it.

You could absolutely insert a rule in nftables on the VM

Could this rule be at the cloudnet or cloudvirt level instead? I wonder if pmtud is enough inside Cloud VPS and we need MSS clamping only for packets coming from outside.

We can discuss this in the Network Sync meeting this Wed.

I can only reproduce this issue on IPv6-enabled VMs. Is it because we don't use VXLAN on the old IPv4-only VMs?

Yeah the legacy neutron network for instances is just plain-l2 and doesn't have this constraint. It bridges directly into the cloud-instances2-b-eqiad vlan on the switches.

All the newer networks are virtual (i.e. only exist in OpenSrack/Neutron) and the hypervisors tunnel the packets to each other using VXLAN (switches don't know about them).

You could absolutely insert a rule in nftables on the VM

Could this rule be at the cloudnet or cloudvirt level instead? I wonder if pmtud is enough inside Cloud VPS and we need MSS clamping only for packets coming from outside.

I think it'd be needed on the cloudvirt at least, so that a flow to another VM (which may go direct cloudvirt -> cloudvirt) would work. What I don't know is if Neutron supports setting it up, or how the use of Open vSwitch (rather than a linux bridge) affects the ability to use nftables.

We can discuss this in the Network Sync meeting this Wed.

Yep let's do that.

  1. Move cloudvirt/cloudnet to support jumbo frames ensuring VMs can have a 1500 byte MTU

Mentioning cloudvirt + jumbo frames tracking task here for cross-linking T330075: [cloudvirt] Enable jumbo frames on cloud-hosts/cloud-private interfaces

Today I'm draining a cloudvirt and I see this error in the logs (along with a failed migration):

ERROR nova.virt.libvirt.driver [None req-f78d8ae1-85ce-4c5e-8c70-7478259f91b0 novaadmin admin - - default default] [instance: cbaa974c-03dc-4840-87e3-f6a76f8402c5] Live Migration failure: unsupported configuration: Target network card MTU 1500 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 1500 does not match source 1450
2025-11-14 23:30:39.060 2370371 ERROR nova.virt.libvirt.driver [None req-f78d8ae1-85ce-4c5e-8c70-7478259f91b0 novaadmin admin - - default default] [instance: cbaa974c-03dc-4840-87e3-f6a76f8402c5] Migration operation has aborted

I was able to migrate a bunch of other things, though. Was a specific window during which new VMs got the cursed MTU set?

As @taavi predicted, a reboot --hard of that server reset the MTU and allowed it to migrate. So that's good, and suggests that maybe we only need to reboot a select subset of VMs to get everyone on the same page mtu-wise.

sudo cumin --backend openstack "*" 'ip addr | grep "mtu 1450"'

Returns 139 matches:

T389375.appservers.eqiad1.wikimedia.cloud,abogott-test-instance.account-creation-assistance.eqiad1.wikimedia.cloud,abogott-testvm.wikicommunityhealth.eqiad1.wikimedia.cloud,accounts-appserver7.account-creation-assistance.eqiad1.wikimedia.cloud,backend.wikicommunityhealth.eqiad1.wikimedia.cloud,bastion-eqiad1-[5-6].bastion.eqiad1.wikimedia.cloud,buttercup.wikifunctions.eqiad1.wikimedia.cloud,canary[1068-1076]-1.cloudvirt-canary.eqiad1.wikimedia.cloud,canary[1040,1043,1045-1046,1065]-4.cloudvirt-canary.eqiad1.wikimedia.cloud,canary[1042,1044,1047]-3.cloudvirt-canary.eqiad1.wikimedia.cloud,canary1041-5.cloudvirt-canary.eqiad1.wikimedia.cloud,canasta-test.pluggableauth.eqiad1.wikimedia.cloud,chartmuseum-2.cloudinfra.eqiad1.wikimedia.cloud,ci2.mediawiki-quickstart.eqiad1.wikimedia.cloud,ci-components.mediawiki-quickstart.eqiad1.wikimedia.cloud,coder-env-1.mobileappsperformance.eqiad1.wikimedia.cloud,content-diff-index.wmf-research-tools.eqiad1.wikimedia.cloud,copypatrol-backend-prod-02.copypatrol.eqiad1.wikimedia.cloud,ctt-prv-04.wikitextexp.eqiad1.wikimedia.cloud,cvn-apache11.cvn.eqiad1.wikimedia.cloud,cvn-app[13-14].cvn.eqiad1.wikimedia.cloud,dcl-dev1.puppet-dev.eqiad1.wikimedia.cloud,dcl.swift.eqiad1.wikimedia.cloud,debian13-test.dwl.eqiad1.wikimedia.cloud,deep-dive.analytics.eqiad1.wikimedia.cloud,demo-wiki.wikispeech.eqiad1.wikimedia.cloud,deployment-poolcounter07.deployment-prep.eqiad1.wikimedia.cloud,docker-registry-01.cloudinfra.eqiad1.wikimedia.cloud,enc-[3-4].cloudinfra.eqiad1.wikimedia.cloud,filippo-centrallog-02.o11y.eqiad1.wikimedia.cloud,filippo-cloudcephosd-01.o11y.eqiad1.wikimedia.cloud,filippo-clouddumps-01.o11y.eqiad1.wikimedia.cloud,filippo-cloudgw-01.o11y.eqiad1.wikimedia.cloud,filippo-cloudvirt-[01-02].o11y.eqiad1.wikimedia.cloud,filippo-tom-k8s-control-01.testlabs.eqiad1.wikimedia.cloud,filippo-tom-k8s-worker-01.testlabs.eqiad1.wikimedia.cloud,filippo-tom-pki-01.testlabs.eqiad1.wikimedia.cloud,filippo-tom-puppet-01.testlabs.eqiad1.wikimedia.cloud,filippo-tom-puppetdb-01.testlabs.eqiad1.wikimedia.cloud,font-db.signwriting.eqiad1.wikimedia.cloud,generator01.dumpstorrents.eqiad1.wikimedia.cloud,gitlab-1002.devtools.eqiad1.wikimedia.cloud,gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud,gitlab-runner-[1007-1008].devtools.eqiad1.wikimedia.cloud,glamspore-prod-01.wikispore.eqiad1.wikimedia.cloud,journalist1.wmgmc-monitoring.eqiad1.wikimedia.cloud,k3s-envDB.catalyst.eqiad1.wikimedia.cloud,k3s-test.devtools.eqiad1.wikimedia.cloud,k3s-worker01.catalyst-dev.eqiad1.wikimedia.cloud,k3s-worker[01-02].catalyst.eqiad1.wikimedia.cloud,k3s.catalyst.eqiad1.wikimedia.cloud,k3s.wikifunctions.eqiad1.wikimedia.cloud,language-lab.language.eqiad1.wikimedia.cloud,logging-logstash-04.logging.eqiad1.wikimedia.cloud,lpl-cx-sx2.language.eqiad1.wikimedia.cloud,lpl-mleb-master.language.eqiad1.wikimedia.cloud,lpl-mleb-stable2.language.eqiad1.wikimedia.cloud,lpl-mleb-stable.language.eqiad1.wikimedia.cloud,lpl-recommend.language.eqiad1.wikimedia.cloud,lpl-services.language.eqiad1.wikimedia.cloud,mariadbcompiler-trixie.mariadbtest.eqiad1.wikimedia.cloud,mcrouterbuild.testlabs.eqiad1.wikimedia.cloud,mcroutertest-[1-3].testlabs.eqiad1.wikimedia.cloud,mediawiki2latex.collection-alt-renderer.eqiad1.wikimedia.cloud,metricsinfra-grafana-2.metricsinfra.eqiad1.wikimedia.cloud,metricsinfra-thanos-fe-2.metricsinfra.eqiad1.wikimedia.cloud,microk8s.zuul3.eqiad1.wikimedia.cloud,networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud,networktests-vxlan-ipv4only-fip.testlabs.eqiad1.wikimedia.cloud,networktests-vxlan-ipv4only.testlabs.eqiad1.wikimedia.cloud,nfs-client-2.testlabs.eqiad1.wikimedia.cloud,ntp-[5-6].cloudinfra.eqiad1.wikimedia.cloud,octaviatest-[1-2].testlabs.eqiad1.wikimedia.cloud,phi-alert-01.o11y.eqiad1.wikimedia.cloud,phi-arclamp-01.o11y.eqiad1.wikimedia.cloud,phi-grafana-01.o11y.eqiad1.wikimedia.cloud,phi-kafka-01.o11y.eqiad1.wikimedia.cloud,phi-kafkamon-01.o11y.eqiad1.wikimedia.cloud,phi-lb-01.o11y.eqiad1.wikimedia.cloud,phi-mwlog-01.o11y.eqiad1.wikimedia.cloud,phi-pki-01.o11y.eqiad1.wikimedia.cloud,phi-prometheus-[01-02].o11y.eqiad1.wikimedia.cloud,phi-puppet-01.o11y.eqiad1.wikimedia.cloud,phi-puppetdb-01.o11y.eqiad1.wikimedia.cloud,phi-syslog-01.o11y.eqiad1.wikimedia.cloud,phi-webperf-01.o11y.eqiad1.wikimedia.cloud,pixel.pixel.eqiad1.wikimedia.cloud,player1.wmgmc-monitoring.eqiad1.wikimedia.cloud,pontoon-demo-pki-01.testlabs.eqiad1.wikimedia.cloud,pontoon-demo-puppet-01.testlabs.eqiad1.wikimedia.cloud,pontoon-demo-puppetdb-01.testlabs.eqiad1.wikimedia.cloud,pontoon-demo-tf-services-01.testlabs.eqiad1.wikimedia.cloud,press1.wmgmc-monitoring.eqiad1.wikimedia.cloud,prod0.hashtags.eqiad1.wikimedia.cloud,project-proxy-acme-chief-03.project-proxy.eqiad1.wikimedia.cloud,rn-hcptchprxy-pki-01.appservers.eqiad1.wikimedia.cloud,rn-hcptchprxy-puppet-01.appservers.eqiad1.wikimedia.cloud,rn-hcptchprxy-puppetdb-01.appservers.eqiad1.wikimedia.cloud,rn-hcptchprxy-urldownloader-[01-02].appservers.eqiad1.wikimedia.cloud,runner-[1031-1040].gitlab-runners.eqiad1.wikimedia.cloud,section-ranker.recommendation-api.eqiad1.wikimedia.cloud,semantic-search.recommendation-api.eqiad1.wikimedia.cloud,taxonbot4.dwl.eqiad1.wikimedia.cloud,tcp-proxy-test.devtools.eqiad1.wikimedia.cloud,testlabs-nfs-2.testlabs.eqiad1.wikimedia.cloud,tf-registry-3.tofu.eqiad1.wikimedia.cloud,tmp.entity-detection.eqiad1.wikimedia.cloud,tools-bastion-15.tools.eqiad1.wikimedia.cloud,tools-db-7.tools.eqiad1.wikimedia.cloud,tools-harbor-2.tools.eqiad1.wikimedia.cloud,tools-k8s-worker-[112-113].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[80-82].tools.eqiad1.wikimedia.cloud,tools-legacy-redirector-3.tools.eqiad1.wikimedia.cloud,tools-nfs-3.tools.eqiad1.wikimedia.cloud,tools-prometheus-[8-9].tools.eqiad1.wikimedia.cloud,toolsbeta-nfs-[4-5].toolsbeta.eqiad1.wikimedia.cloud,toolsbeta-prometheus-2.toolsbeta.eqiad1.wikimedia.cloud,toolsbeta-test-k8s-worker-nfs-11.toolsbeta.eqiad1.wikimedia.cloud,trixie.search.eqiad1.wikimedia.cloud,util-abogott-trixie.testlabs.eqiad1.wikimedia.cloud,uwl.wikicommunityhealth.eqiad1.wikimedia.cloud,voterlists-1.voterlists.eqiad1.wikimedia.cloud,wikiapiary.mwstake.eqiad1.wikimedia.cloud,wikibase-metadata.wikidata-dev.eqiad1.wikimedia.cloud,wikidata-reconciliation-trixie.wikidata-reconciliation.eqiad1.wikimedia.cloud,wikipeoplestats-db01.wikipeoplestats.eqiad1.wikimedia.cloud,wikistats-trixie.wikistats.eqiad1.wikimedia.cloud,wikiwho-dev.globaleducation.eqiad1.wikimedia.cloud,wsexport-app-prod01.wikisource.eqiad1.wikimedia.cloud,xtools-dev08.xtools.eqiad1.wikimedia.cloud,xtools-prod[14-15].xtools.eqiad1.wikimedia.cloud,zuul-bastion-01.zuul.eqiad1.wikimedia.cloud,zuul-haproxy-01.zuul.eqiad1.wikimedia.cloud,zuul-puppetserver-01.zuul.eqiad1.wikimedia.cloud

Today I'm draining a cloudvirt and I see this error in the logs (along with a failed migration):

ERROR nova.virt.libvirt.driver [None req-f78d8ae1-85ce-4c5e-8c70-7478259f91b0 novaadmin admin - - default default] [instance: cbaa974c-03dc-4840-87e3-f6a76f8402c5] Live Migration failure: unsupported configuration: Target network card MTU 1500 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 1500 does not match source 1450
2025-11-14 23:30:39.060 2370371 ERROR nova.virt.libvirt.driver [None req-f78d8ae1-85ce-4c5e-8c70-7478259f91b0 novaadmin admin - - default default] [instance: cbaa974c-03dc-4840-87e3-f6a76f8402c5] Migration operation has aborted

I was able to migrate a bunch of other things, though. Was a specific window during which new VMs got the cursed MTU set?

If an another VM ends up in this state, could you please leave it there for me to have a look at? This does not match with what I've seen for live-migrating the VMs with an MTU mismatch and I wouldn't have expected reboot --hard to fix it (as that's a reboot, not a stop-and-start).

ec318e06-1ddc-4856-8e37-17a2a5aeb0b3 | tcp-proxy-test on cloudvirt1044 is showing the migration issue.

2025-11-17 14:29:24.723 2029263 ERROR nova.virt.libvirt.driver [None req-ed4b924f-fdeb-48fc-ad4b-4df1116196c2 novaadmin admin - - default default] [instance: ec318e06-1ddc-4856-8e37-17a2a5aeb0b3] Live Migration failure: unsupported configuration: Target network card MTU 1500 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 1500 does not match source 1450

Ah, I knew I could not be the first one to stumble upon this! Thanks for the link @bd808.

My fix for my one VM use case was indeed to tell puppet to set mtu=1450 via:

docker::configuration::settings:
  data-root: /mnt/docker-scratch/docker
  mtu: 1450

Mentioned in SAL (#wikimedia-cloud) [2025-11-24T19:48:16Z] <JJMC89> copypatrol-backend-prod-02 hard reboot for T408543