Page MenuHomePhabricator

Some traffic still flowing to mw-api-int after the switchover
Closed, ResolvedPublic

Event Timeline

jijiki changed the task status from Open to In Progress.Mar 26 2026, 10:05 AM
jijiki triaged this task as High priority.

tagging as high, given it shouldn't have happened

We had a bit of a hard time finding source IPs but from access logs it's clear that it's all over the place.

kafka-logging1001:~$ kafkacat -C -b kafka-logging1001.eqiad.wmnet:9093 -t mediawiki.httpd.accesslog -X security.protocol=SSL -o -1   2>/dev/null |  grep -F '"mw-api-int"' |  grep wikikube-worker2 | jq . | grep user_agent

Shows requests from mobileapps, wikifeeds, termbox, ... a roll restart of wikifeeds made it switch to eqiad.

  • All nameservers, recursors and coredns instances which are all returning the eqiad ip for mw-api-int-ro.discovery.wmnet and mw-api-int.discovery.wmnet.
  • envoy versions are all 1.35.x

Checking what IP(s) envoy has looked up for the mw-api-int cluster revealed that not even all replicas of a deployment are affected (10.2.1.81 is mw-api-int.svc.codfw.wmnet.)

curl -s --unix-socket run/envoy/admin.sock http://localhost/clusters | grep 10.2.1.81

Those pods who where affected are showing dns resolution errors in envoy debug:

[2026-03-26 11:56:46.853][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:194] dns resolution for mw-api-int.discovery.wmnet failed with c-ares status 0x000c: "Timeout while contacting DNS servers"
[2026-03-26 11:56:46.853][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:294] DNS request timed out 4 times  
(for multiple other names as well)

Unfortunately we're dropping the corresponding cluster statistic during scraping:

envoy_cluster_update_attempt: Total attempted cluster membership updates by service discovery
envoy_cluster_update_success: Total successful cluster membership updates by service discovery
envoy_cluster_update_failure: Total failed cluster membership updates by service discovery
12:14:48.410659 eth0  Out IP (tos 0x0, ttl 64, id 9615, offset 0, flags [DF], proto UDP (17), length 168)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x09af -> 0xcc26!] 46089+ [1au] A? main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (140)
12:14:48.415252 eth0  Out IP (tos 0x0, ttl 64, id 23860, offset 0, flags [none], proto UDP (17), length 79)
    10.67.172.131.38667 > 10.64.72.3.53: [bad udp cksum 0x0956 -> 0x7aa9!] 7370+ [1au] A? www.ccc.de. ar: . OPT UDPsize=1232 [COOKIE bd4461312ad92f17] (51)
12:14:48.514672 eth0  In  IP (tos 0x0, ttl 59, id 42018, offset 0, flags [DF], proto UDP (17), length 105)
    10.64.72.3.53 > 10.67.172.131.38667: [udp sum ok] 7370 q: A? www.ccc.de. 1/0/1 www.ccc.de. A 185.106.84.100 ar: . OPT UDPsize=1232 [COOKIE bd4461312ad92f17] (77)
12:14:48.679662 eth0  Out IP (tos 0x0, ttl 64, id 9616, offset 0, flags [DF], proto UDP (17), length 168)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x09af -> 0xcc26!] 46089+ [1au] A? main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (140)
12:14:48.892106 eth0  Out IP (tos 0x0, ttl 64, id 9617, offset 0, flags [DF], proto UDP (17), length 122)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x0981 -> 0xb619!] 47251+ [1au] A? restbase.discovery.wmnet.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (94)

Not sure whats up with this wild domain appending, but when I query one of those manually (dig @10.64.72.3 eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. from the pod netns) coredns still responds:

12:17:18.097389 eth0  Out IP (tos 0x0, ttl 64, id 50332, offset 0, flags [none], proto UDP (17), length 128)
    10.67.172.131.34006 > 10.64.72.3.53: [bad udp cksum 0x0987 -> 0x8331!] 22071+ [1au] A? eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE 1596b8dbb45c161f] (100)
12:17:18.097682 eth0  In  IP (tos 0x0, ttl 59, id 23454, offset 0, flags [DF], proto UDP (17), length 221)
    10.64.72.3.53 > 10.67.172.131.34006: [udp sum ok] 22071 NXDomain*- q: A? eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. 0/1/1 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1774527435 7200 1800 86400 5 ar: . OPT UDPsize=1232 [COOKIE 1596b8dbb45c161f] (193)

Lookups from envoy right after fail again:

12:17:21.069132 eth0  Out IP (tos 0x0, ttl 64, id 9959, offset 0, flags [DF], proto UDP (17), length 128)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x0987 -> 0xe5f8!] 25991+ [1au] A? eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (100)
12:17:21.319133 eth0  Out IP (tos 0x0, ttl 64, id 9960, offset 0, flags [DF], proto UDP (17), length 128)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x0987 -> 0xe5f8!] 25991+ [1au] A? eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (100)
12:17:21.575557 eth0  Out IP (tos 0x0, ttl 64, id 9961, offset 0, flags [DF], proto UDP (17), length 128)
    10.67.172.131.41326 > 10.64.72.3.53: [bad udp cksum 0x0987 -> 0xe5f8!] 25991+ [1au] A? eventgate-main.discovery.wmnet.mobileapps.svc.cluster.local. ar: . OPT UDPsize=1232 [COOKIE b25a53a7a90c5b4f] (100)

Change #1261485 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::k8s: Ingest envoy cluster_update metrics

https://gerrit.wikimedia.org/r/1261485

One thing that seemed odd is that envoy constantly re-uses the same sport (41326 in the above examples) to send the DNS querie. Since this was the only thing really standing out as a difference compared to requests send by dig we checked the conntrack table again which initially seemed innocent:

udp      17 29 src=10.67.172.131 dst=10.64.72.3 sport=41326 dport=53 [UNREPLIED] src=10.64.72.3 dst=10.67.172.131 sport=53 dport=41326 mark=0 use=1

Whats wrong with that is that on the source IP of the reply direction is the kube-dns service IP, not the IP of a coredns Pod. This led to the assumption that iptables DNAT (which kube-proxy iptables rules does to send traffic for a service IP to service endpoints) did not properly happen before the rule was created. Envoy on the other hand seems to constantly reuse the UDP socket, so that the packages always hit this conntrack entry and thus it's never cleared from the table.

To clear this situation I ran conntrack -D --dst 10.64.72.3 --reply-src 10.64.72.3 -p udp --dport 53 on all wikikube-workers in codfw.

The working theory (which we can't properly proof I suppose) is that the affected envoys did send DNS queries in the during T374366: Race condition in iptables rules during puppet runs on k8s nodes - thus while the DNAT rules where not in place. The fact that envoy sticks to the same UDP socket on timeout makes the situation stay like that until the pod is destroyed during deployments or maintenance. The latter is the assumed reason for this not happening in codfw since we recently rebootet all k8s workers and only added new ones (e.g. triggering ferm reloads) in eqiad.

I have added a patch to include (envoy-)cluster upgrade as well as envoy_dns metrics into our scraping config (we're basically dropping all metrics apart from those we deemed useful in the past). This should give us the opportunity to check for/alert on raising cluster update/dns errors.

envoy also improved the handling of UDP sockets in 1.36 and 1.37:

  • 1.36 added max_udp_channel_duration: The maximum duration for which a UDP channel will be kept alive before being refreshed.
  • 1.37 added reinit_channel_on_timeout: If true, reinitialize the c-ares channel when a DNS query fails with ARES_ETIMEOUT.

Many thanks to @cmooney for supporting this adventurous journey!

These cumin commands can be used to quickly check for this issue re-appearing:

sudo cumin 'A:wikikube-master-codfw or A:wikikube-worker-codfw' 'conntrack -L --dst 10.192.72.3 --reply-src 10.192.72.3 2>/dev/null |grep UNREPLIED || true'
sudo cumin 'A:wikikube-master-eqiad or A:wikikube-worker-eqiad' 'conntrack -L --dst 10.64.72.3 --reply-src 10.64.72.3 2>/dev/null |grep UNREPLIED || true'

I updated Switch_Datacenter to include this task, for the next switchover.

Thanks for the write-up @JMeybohm. Definitely an odd one.

Ideally we would work out the root cause, where there is obviously a time when envoy can send dns queries when the coredns iptables rules aren't present, creating an invalid conntrack without NAT. That short-term issue then becomes a problem because:

  • Envoy does not make a new dns query if it gets a timeout, instead it just re-transits the failed packet (thus same source port).
  • These retries match the existing (non-natted) conntrack, so they skip the NAT rules and fail.
  • Envoy keeps sending these retries, preventing the "bad" conntrack from expiring as it would otherwise.

If we can upgrade Envoy and enable reinit_channel_on_timeout it should prevent the initial problem becoming a longer term problem. Envoy will re-try on a new port, and get an answer, and the bad conntrack will expire normally within 30 seconds.

Unfortunately I see the "bad" conntrack entries again today in eqiad and codfw (and I don't think we've added nodes, haven't checked yet):

(1) wikikube-worker-exp2001.codfw.wmnet                                                                                                                                             
----- OUTPUT of 'conntrack -L --d...NREPLIED || true' -----                                                                                                                         
udp      17 29 src=10.194.240.38 dst=10.192.72.3 sport=35667 dport=53 [UNREPLIED] src=10.192.72.3 dst=10.194.240.38 sport=53 dport=35667 mark=0 use=1                               
udp      17 29 src=10.194.240.33 dst=10.192.72.3 sport=59719 dport=53 [UNREPLIED] src=10.192.72.3 dst=10.194.240.33 sport=53 dport=59719 mark=0 use=1  
(1) wikikube-worker-exp1001.eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'conntrack -L --d...NREPLIED || true' -----                                                                                                                         
udp      17 29 src=10.67.169.99 dst=10.64.72.3 sport=51897 dport=53 [UNREPLIED] src=10.64.72.3 dst=10.67.169.99 sport=53 dport=51897 mark=0 use=1                                   
udp      17 29 src=10.67.169.101 dst=10.64.72.3 sport=51355 dport=53 [UNREPLIED] src=10.64.72.3 dst=10.67.169.101 sport=53 dport=51355 mark=0 use=1
JMeybohm changed the task status from In Progress to Stalled.Apr 8 2026, 1:16 PM

Unfortunately I see the "bad" conntrack entries again today in eqiad and codfw (and I don't think we've added nodes, haven't checked yet):

Those went away on their own...

Marking as stalled waiting for signoff from Observability-Metrics on the prometheus config change

Change #1261485 merged by JMeybohm:

[operations/puppet@production] prometheus::k8s: Ingest envoy cluster_update metrics

https://gerrit.wikimedia.org/r/1261485

I've deployed the prometheus config change so we should have some additional visibility something like this should happen again.