Page MenuHomePhabricator

Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org
Closed, ResolvedPublic

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 removed hashar as the assignee of this task.

Seen in:

git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git remote prune --dry-run origin
  stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/DiscussionTools/': Could not resolve host: gerrit.wikimedia.org'
Build step 'Execute shell' marked build as failure

Previously: T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org, thought to be caused by T380844: 2024-11-26 Toolforge DNS incident.

bd808 renamed this task from Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org to Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.Nov 27 2024, 4:35 PM

Noting that this is not specific of this project, on tools we can see failures spread throughout the day from apiserver pods on k8s:

root@tools-k8s-control-9:~# kubectl logs --timestamps -n kube-system pod/kube-apiserver-tools-k8s-control-7 | grep timeout
2024-11-26T19:26:14.837618565Z W1126 19:26:14.837267       1 logging.go:59] [core] [Channel #18526 SubChannel #18527] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:54305->172.20.255.1:53: i/o timeout"
...
2024-11-27T15:19:17.460516016Z W1127 15:19:17.460205       1 logging.go:59] [core] [Channel #59647 SubChannel #59648] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:57729->172.20.255.1:53: i/o timeout"
2024-11-27T15:29:19.548211117Z W1127 15:29:19.547805       1 logging.go:59] [core] [Channel #59987 SubChannel #59988] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:56126->172.20.255.1:53: i/o timeout"
2024-11-27T15:35:20.845977552Z W1127 15:35:20.845689       1 logging.go:59] [core] [Channel #60194 SubChannel #60195] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:39192->172.20.255.1:53: i/o timeout"

Also, having read all the task, I'd bet the problem is first of all a network instability, as @bd808 has suggested.

One thing you could try is to force dns resolution on a few machines in that group to use TCP (declaring use-vc in resolv.conf, IIRC).

Using TCP we should both get more failure tolerance and better debugging info, at the cost of slightly more expensive dns queries. If after the change has been running for a few days, the machines using TCP have a lower failure rate, it's highly probable the problem is indeed with the network.

This idea from @Joe about trying TCP lookups rather than the default UDP lookups seems like something worth testing to see if we can get any new insights into where in the resolver stack things are getting lost. Could that actually be a useful monitoring variation to build out for T380980: openstack: introduce additional DNS monitoring and alerting?

From T374830#10198891, I went on contint1002 and ran the oneliner:

grep -l 'Could not resolve host' /srv/jenkins/builds/*quibble*/*/log|xargs ls --full-time -rt|awk '{ print $6 " " $7 }'

That yields the timestamp of the build that failed:

1For T374830
2
3Ran it on Nov 27 at Ran it at 18:50 UTC.
4
5
6$ grep -l 'Could not resolve host' /srv/jenkins/builds/*quibble*/*/log|xargs ls --full-time -rt|awk '{ print $6 " " $7 }'
7
82024-11-24 06:33:49.633911825
92024-11-24 12:04:04.975860750
102024-11-24 12:04:07.743879029
112024-11-24 12:31:42.226620670
122024-11-24 12:31:47.222653292
132024-11-24 13:41:22.958085722
142024-11-24 14:14:13.058996736
152024-11-24 14:15:20.471438631
162024-11-24 14:30:01.565239375
172024-11-24 16:19:03.968079948
182024-11-25 00:35:18.907376327
192024-11-25 01:02:23.938102583
202024-11-25 01:02:24.814108331
212024-11-25 01:02:28.890135069
222024-11-25 02:47:28.643381595
232024-11-25 04:00:41.904254105
242024-11-25 04:11:06.108309815
252024-11-25 07:15:58.901500108
262024-11-25 08:17:38.061940647
272024-11-25 08:17:57.882071466
282024-11-25 08:18:06.798130319
292024-11-25 08:18:32.306298701
302024-11-25 09:29:25.890192953
312024-11-25 09:29:39.994285245
322024-11-25 09:37:28.557355523
332024-11-25 09:41:27.206921228
342024-11-25 09:41:30.606943540
352024-11-25 09:43:32.487743408
362024-11-25 09:43:35.439762785
372024-11-25 09:43:36.971772839
382024-11-25 09:43:40.959799015
392024-11-25 09:43:41.167800381
402024-11-25 10:04:08.255847499
412024-11-25 10:04:11.611869525
422024-11-25 10:04:11.619869577
432024-11-25 10:04:12.331874251
442024-11-25 10:04:13.907884593
452024-11-25 10:04:15.455894752
462024-11-25 10:04:16.235899872
472024-11-25 10:04:16.347900607
482024-11-25 10:04:20.859930218
492024-11-25 10:04:20.971930954
502024-11-25 10:04:25.263959122
512024-11-25 10:29:07.405744440
522024-11-25 10:29:17.837813729
532024-11-25 10:40:16.694172922
542024-11-25 10:40:19.426190953
552024-11-25 10:40:21.786206529
562024-11-25 10:43:45.439550054
572024-11-25 10:43:50.051580468
582024-11-25 10:43:54.227608006
592024-11-25 10:44:01.447655619
602024-11-25 10:44:02.075659761
612024-11-25 10:44:27.019824245
622024-11-25 10:44:27.131824984
632024-11-25 10:44:27.599828070
642024-11-25 10:44:30.679848379
652024-11-25 10:44:39.951909516
662024-11-25 10:44:41.651920725
672024-11-25 10:44:41.763921464
682024-11-25 10:44:41.763921464
692024-11-25 10:44:46.087949974
702024-11-25 10:44:46.127950238
712024-11-25 10:44:57.252023587
722024-11-25 10:46:18.724560712
732024-11-25 10:46:21.188576954
742024-11-25 10:47:05.520869178
752024-11-25 10:47:05.684870259
762024-11-25 10:47:05.752870708
772024-11-25 10:47:07.388881491
782024-11-25 10:47:08.836891035
792024-11-25 10:47:10.768903770
802024-11-25 10:47:25.068998023
812024-11-25 10:47:26.953010440
822024-11-25 10:47:27.193012022
832024-11-25 10:47:28.577021144
842024-11-25 10:47:39.201091168
852024-11-25 10:47:43.861121881
862024-11-25 10:47:44.073123278
872024-11-25 10:47:44.265124544
882024-11-25 10:47:44.633126969
892024-11-25 10:47:44.685127312
902024-11-25 10:47:45.985135880
912024-11-25 10:47:46.409138675
922024-11-25 10:47:48.237150723
932024-11-25 10:48:05.645265455
942024-11-25 10:48:05.797266456
952024-11-25 10:48:06.061268196
962024-11-25 10:48:08.045281272
972024-11-25 10:48:09.365289971
982024-11-25 10:48:09.593291474
992024-11-25 10:48:16.557337371
1002024-11-25 10:48:16.681338188
1012024-11-25 10:48:33.197447034
1022024-11-25 10:48:33.213447140
1032024-11-25 10:48:33.617449802
1042024-11-25 10:48:34.053452676
1052024-11-25 10:48:34.245453941
1062024-11-25 10:48:34.525455786
1072024-11-25 10:48:34.633456498
1082024-11-25 10:48:34.993458870
1092024-11-25 10:48:35.557462588
1102024-11-25 10:48:35.653463220
1112024-11-25 10:50:34.422245847
1122024-11-25 10:50:34.550246691
1132024-11-25 10:50:36.722261002
1142024-11-25 10:50:38.446272361
1152024-11-25 10:50:39.074276500
1162024-11-25 10:52:15.678912976
1172024-11-25 10:52:15.682913003
1182024-11-25 10:52:15.978914953
1192024-11-25 10:52:16.110915822
1202024-11-25 10:52:16.278916929
1212024-11-25 12:31:54.122472952
1222024-11-25 12:31:54.942478386
1232024-11-25 12:31:59.902511246
1242024-11-25 12:31:59.974511723
1252024-11-25 12:32:05.806550359
1262024-11-25 12:32:11.654589099
1272024-11-25 12:32:11.790590001
1282024-11-25 12:32:15.142612207
1292024-11-25 12:32:15.286613161
1302024-11-25 13:54:26.298967234
1312024-11-25 17:13:44.337615206
1322024-11-25 17:46:20.258821678
1332024-11-25 20:30:11.583850904
1342024-11-25 21:11:14.375423085
1352024-11-25 21:15:21.101018011
1362024-11-25 21:21:39.835468390
1372024-11-25 21:21:41.459478901
1382024-11-25 23:05:14.023499348
1392024-11-25 23:05:18.867530670
1402024-11-25 23:17:26.120228789
1412024-11-25 23:42:08.314286267
1422024-11-26 02:06:41.695413082
1432024-11-26 03:40:28.101309494
1442024-11-26 16:26:53.128824231
1452024-11-26 19:32:46.681203506
1462024-11-26 21:30:13.863148378
1472024-11-26 21:30:31.515264450
1482024-11-26 21:43:59.624570383
1492024-11-26 21:44:00.148573821
1502024-11-26 21:44:00.608576839
1512024-11-26 21:44:01.452582376
1522024-11-26 22:13:08.052037934
1532024-11-26 22:13:10.408053339
1542024-11-26 22:13:10.812055981
1552024-11-26 22:13:11.148058177
1562024-11-26 22:13:12.652068009
1572024-11-27 09:07:05.030602092
1582024-11-27 09:07:11.838646711
1592024-11-27 12:27:58.201251718
1602024-11-27 14:07:28.604812413
1612024-11-27 14:07:35.128855561
1622024-11-27 14:36:30.392146426
1632024-11-27 14:36:35.856182326
1642024-11-27 14:36:36.636187452
1652024-11-27 15:10:40.545546018
1662024-11-27 15:10:41.165550085
1672024-11-27 15:10:43.069562574
1682024-11-27 15:10:43.257563808
1692024-11-27 15:10:44.833574146
1702024-11-27 15:10:46.333583986
1712024-11-27 15:10:48.105595610
1722024-11-27 15:10:48.241596503
1732024-11-27 15:16:51.955969741
1742024-11-27 15:16:52.187971215
1752024-11-27 15:16:53.363978679
1762024-11-27 15:17:01.460030083
1772024-11-27 15:19:10.864854537
1782024-11-27 15:19:30.820982105
1792024-11-27 15:19:35.345011040
1802024-11-27 15:29:19.048775247
1812024-11-27 15:29:22.492797581
1822024-11-27 15:29:28.768838282
1832024-11-27 15:35:27.023165871
1842024-11-27 15:35:41.899262654
1852024-11-27 15:35:41.899262654
1862024-11-27 15:35:46.583293132
1872024-11-27 15:35:51.035322101
1882024-11-27 15:35:52.439331236
1892024-11-27 15:35:56.567358098
1902024-11-27 15:35:56.851359945
1912024-11-27 15:35:57.175362055
1922024-11-27 16:38:23.760332183
1932024-11-27 16:38:24.012333912
1942024-11-27 17:09:22.004778628

This happened recently to a user on a random cloud VPS instance and in that case /etc/resolv.conf simply did not have any nameserver line in it. Then we added that back manually (after comparing with another instance) and that fixed the issue there.

Of course the question remained what specifically removed that.

This happened recently to a user on a random cloud VPS instance and in that case /etc/resolv.conf simply did not have any nameserver line in it. Then we added that back manually (after comparing with another instance) and that fixed the issue there.

Of course the question remained what specifically removed that.

Thanks @Dzahn! I believe this to be T379927 which is at least mostly fixed.

I ran

watch -n 0.1 -e dig gerrit.wikimedia.org

and

ping -D  172.20.255.1

in separate terminals at the same time.

The dig test timed out at Sat Nov 30 23:20:26 2024.

In the ping test, I see a disruption:

[1733008825.925129] 64 bytes from 172.20.255.1: icmp_seq=577 ttl=61 time=0.624 ms
[1733008826.948998] 64 bytes from 172.20.255.1: icmp_seq=578 ttl=61 time=0.381 ms
[1733008827.972982] 64 bytes from 172.20.255.1: icmp_seq=579 ttl=61 time=0.503 ms
[1733008829.848244] 64 bytes from 172.20.255.1: icmp_seq=580 ttl=61 time=852 ms
[1733008829.997627] 64 bytes from 172.20.255.1: icmp_seq=581 ttl=61 time=0.385 ms
[1733008831.835319] 64 bytes from 172.20.255.1: icmp_seq=582 ttl=61 time=823 ms
[1733008832.835334] 64 bytes from 172.20.255.1: icmp_seq=583 ttl=61 time=822 ms

Those don't align perfectly (the ping complaint is 3 seconds later than the dig failure) but it does suggest the network is failing intermittently, at least to that VIP.

Now I'm watching two dueling ping logs from the tools bastion:

ping -D  172.20.255.1 # the cloud-vps recursor

and

ping -D  208.80.154.224 #  en.wikipedia.org

They have synchronized bouts of 1000x slowdowns, although for some reason the slowdowns to the recursor vip are much more obvious.

172.20.255.1:

[1733010160.996819] 64 bytes from 172.20.255.1: icmp_seq=346 ttl=61 time=0.309 ms
[1733010162.020879] 64 bytes from 172.20.255.1: icmp_seq=347 ttl=61 time=0.372 ms
[1733010163.045158] 64 bytes from 172.20.255.1: icmp_seq=348 ttl=61 time=0.388 ms
[1733010164.068874] 64 bytes from 172.20.255.1: icmp_seq=349 ttl=61 time=0.288 ms
[1733010166.841273] 64 bytes from 172.20.255.1: icmp_seq=351 ttl=61 time=725 ms
[1733010167.828297] 64 bytes from 172.20.255.1: icmp_seq=352 ttl=61 time=710 ms
[1733010168.827217] 64 bytes from 172.20.255.1: icmp_seq=353 ttl=61 time=708 ms
[1733010169.845681] 64 bytes from 172.20.255.1: icmp_seq=354 ttl=61 time=726 ms
[1733010170.828452] 64 bytes from 172.20.255.1: icmp_seq=355 ttl=61 time=708 ms
[1733010171.832248] 64 bytes from 172.20.255.1: icmp_seq=356 ttl=61 time=712 ms
[1733010172.829873] 64 bytes from 172.20.255.1: icmp_seq=357 ttl=61 time=708 ms
[1733010173.827529] 64 bytes from 172.20.255.1: icmp_seq=358 ttl=61 time=705 ms
[1733010174.831098] 64 bytes from 172.20.255.1: icmp_seq=359 ttl=61 time=707 ms
[1733010175.124916] 64 bytes from 172.20.255.1: icmp_seq=360 ttl=61 time=0.412 ms
[1733010176.133000] 64 bytes from 172.20.255.1: icmp_seq=361 ttl=61 time=0.398 ms
[1733010177.157015] 64 bytes from 172.20.255.1: icmp_seq=362 ttl=61 time=0.456 ms

208.80.154.224:

[1733010169.444875] 64 bytes from 208.80.154.224: icmp_seq=353 ttl=60 time=0.369 ms
[1733010170.472818] 64 bytes from 208.80.154.224: icmp_seq=354 ttl=60 time=0.371 ms
[1733010171.829655] 64 bytes from 208.80.154.224: icmp_seq=355 ttl=60 time=337 ms
[1733010172.496471] 64 bytes from 208.80.154.224: icmp_seq=356 ttl=60 time=0.290 ms
[1733010173.509201] 64 bytes from 208.80.154.224: icmp_seq=357 ttl=60 time=0.456 ms
[1733010174.532998] 64 bytes from 208.80.154.224: icmp_seq=358 ttl=60 time=0.459 ms

Confirmed that the slow ping times happen w/not there's extra dns lookups happening. However, the slow pings seem to be cloudvirt-specific; I don't see a correlation between slow pings on tools-login (cloudvirt1049) and util-abogott-bookworm (cloudvirt1038)

This is possibly related to https://phabricator.wikimedia.org/T381078 although the timing doesn't really line up

This is still happening and it seems hard to get things merged.

I had some hope that work on T381373 would help with this but it doesn't seem to. I'm still seeing ping hesitations today:

[1733241655.908881] 64 bytes from 172.20.255.1: icmp_seq=74 ttl=61 time=0.380 ms
[1733241656.932868] 64 bytes from 172.20.255.1: icmp_seq=75 ttl=61 time=0.340 ms
[1733241658.837908] 64 bytes from 172.20.255.1: icmp_seq=76 ttl=61 time=881 ms
[1733241658.958290] 64 bytes from 172.20.255.1: icmp_seq=77 ttl=61 time=0.412 ms
[1733241659.973185] 64 bytes from 172.20.255.1: icmp_seq=78 ttl=61 time=0.531 ms

FWIW I couldn't replicate this over 1,000 pings from a VPS instance, worse response time was 5.9ms, average 0.4ms.

cmooney@netops-clab:~$ mtr -b -w -c 1000 172.20.255.1 
Start: 2024-12-03T17:37:56+0000
HOST: netops-clab                                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs (172.16.0.1)                    0.1%  1000    0.4   0.2   0.2   6.1   0.2
  2.|-- vlan1107.cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.234)        0.1%  1000    0.5   0.3   0.3   6.4   0.2
  3.|-- irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org (185.15.56.243)  0.1%  1000    1.2   7.4   1.0 133.7  12.2
  4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.5   0.4   0.3   5.9   0.3

Might be worth running the same test @Andrew, could perhaps show us if the issue is a particular node on the path.

andrew@bastion-restricted-eqiad1-3:~$ mtr -b -w -c 1000 172.20.255.1
Start: 2024-12-03T18:00:13+0000
HOST: bastion-restricted-eqiad1-3                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs (172.16.0.1)                    0.0%  1000    0.2   0.2   0.2   1.8   0.1
  2.|-- vlan1107.cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.234)        0.9%  1000    0.3   0.3   0.2 217.8   6.9
  3.|-- irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org (185.15.56.243)  1.1%  1000    2.7   7.0   0.8 104.8  10.4
  4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.7%  1000    0.4  22.7   0.2 918.4 128.4
andrew@tools-bastion-13:~$ sudo su -
mtr -b -w -c 1000 172.20.255.1 root@tools-bastion-13:~# mtr -b -w -c 1000 172.20.255.1 
Start: 2024-12-03T17:59:50+0000
HOST: tools-bastion-13                                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs (172.16.0.1)                    0.0%  1000    0.2   0.2   0.2   1.7   0.1
  2.|-- vlan1107.cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.234)        0.5%  1000    0.4   0.5   0.2 297.9   9.4
  3.|-- irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org (185.15.56.243)  1.1%  1000    6.7  17.9   0.9 1038.  80.5
  4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.9%  1000    0.4  22.4   0.3 806.2 117.4

I'm thinking this has correlation with hypervisor, so here's that test run on one canary VM on each of our cloudvirts:

P71503

I was hoping the slow lookups would correspond to rack or row, but it seems not.

Rack C8
HOST: canary1031-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.5   0.3   0.3   8.6   0.4
HOST: canary1032-6 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.3   0.3   0.3   9.9   0.4
HOST: canary1033-5 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.6%  1000    0.5  36.0   0.3 1617. 164.4
HOST: canary1034-4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        1.0%  1000    0.4   0.3   0.3   1.8   0.1
HOST: canary1035-4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.6%  1000    0.4   0.3   0.3   3.3   0.2

Rack D5
HOST: canary1036-1 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.6%  1000    0.4   0.3   0.2   2.6   0.1
HOST: canary1037-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.1%  1000    0.3   0.3   0.2   5.9   0.2
HOST: canary1038-4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.9%  1000    0.4   0.3   0.2   3.7   0.2
HOST: canary1039-5 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.3%  1000    0.4   0.3   0.2   2.8   0.2
HOST: canary1040-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.5   0.3   0.3  18.9   0.7
HOST: canary1041-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.3   0.3   0.2   1.5   0.1
HOST: canary1042-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.4   0.3   0.3   2.1   0.1
HOST: canary1043-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        1.1%  1000    0.4  24.0   0.3 976.2 131.7
HOST: canary1044-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        2.9%  1000    0.4  12.9   0.2 1095.  95.8
HOST: canary1045-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.3%  1000    0.4  36.6   0.2 1052. 162.9
HOST: canary1046-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.6%  1000    0.4   0.5   0.2 337.8  10.7
HOST: canary1047-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.4   0.3   0.3   3.8   0.2
HOST: canarylocal1001-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        4.7%  1000    0.7  10.1   0.3 1202.  80.9

Rack E4
HOST: canary1048-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        3.9%  1000    0.4   9.4   0.2 1154.  76.6
HOST: canary1049-6 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.3%  1000    0.3   0.3   0.3  21.5   0.7
HOST: canary1051-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.3%  1000    0.3   0.3   0.2   2.8   0.2
HOST: canary1054-4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.7%  1000    0.4   0.3   0.3   2.7   0.2
HOST: canary1055-4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        1.0%  1000    0.4   0.3   0.2   2.0   0.1
HOST: canary1056-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        4.1%  1000    0.5   9.9   0.3 1119.  84.9
HOST: canary1057-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        1.1%  1000    0.5   0.3   0.3   2.2   0.2
HOST: canarylocal1002-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        1.3%  1000    0.4  36.1   0.3 824.4 158.2

Rack F4
HOST: canary1050-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.4   0.3   0.3   3.1   0.2
HOST: canary1052-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.9%  1000    0.4   0.3   0.2  16.4   0.5
HOST: canary1053-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.5%  1000    0.3   0.3   0.3   6.9   0.3
HOST: canary1058-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.4%  1000    0.3   0.3   0.2   5.4   0.2
HOST: canary1059-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.3   0.3   0.3  13.6   0.4
HOST: canary1060-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        3.7%  1000    0.4  13.7   0.3 1130.  98.5
HOST: canary1061-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.3  47.9   0.3 1613. 190.3
HOST: canary1062-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        2.6%  1000    0.6  18.8   0.3 1194. 118.3
HOST: canary1063-1 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.3%  1000    0.7   0.6   0.3   2.5   0.2
HOST: canary1064-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.1%  1000    0.4  42.4   0.3 824.0 167.0
HOST: canary1065-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.8%  1000    0.5   0.4   0.3   7.5   0.3
HOST: canary1066-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.0%  1000    0.5  49.5   0.3 1602. 193.8
HOST: canary1067-2 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.1%  1000    0.3   0.3   0.3   4.3   0.2
HOST: canarylocal1003-3 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.7%  1000    0.6   0.5   0.3   3.2   0.2

The set of 'slow' hypervisors includes cloudvirt1033 which is is currently drained and running only a canary. So that rules out noisy neighbor issues within a cloudvirt.

This comment was removed by cmooney.

Some very interesting results Andrew.

First thing I notice is that none of the traces show any high latency to the first hop (the cloudnet gw, there are actually a few hops in that path but as they are at layer 2 they don't show, cloudnet / neutron router is the first hop we see. P71503#286481

That's where it gets a little murkier, in some of the traces the high RTT are present at the second hop (cloudgw), indicating perhaps some problem between cloudnet and cloudgw, whereas on others the cloudgw hop is fine, but hit RTT shows up at the third one - the cloudsw.

The problem seems to be reproducable from the cloudnet box itself, so I think we can rule out either the VMs or cloudvirts as playing a role here:

root@cloudnet1005:~# mtr -b -w -c 1000 --address 172.16.0.1 172.20.255.1
Start: 2024-12-03T20:51:46+0000
HOST: cloudnet1005  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 185.15.56.234  0.2%  1000    0.2   0.1   0.1   0.5   0.0
  2.|-- 185.15.56.243  0.0%  1000    6.5   7.3   0.7 102.6  11.3
  3.|-- 172.20.255.1   0.2%  1000    0.2  22.5   0.1 1519. 124.8

HOWEVER if I make a lot of DNS queries from the same place, the response time is always fine, we see no high response times.

What is the underlying reported issue here? As it's DNS we might want to focus on that and forget the pings.

I ran this for a few mins to test:

while true; do dig +nsid -b 172.16.0.1 www.google.ie @172.20.255.1 | egrep "NSID|Query time"; echo ""; done | tee /tmp/query_check.txt
root@cloudnet1005:~# grep -c "Query time" /tmp/query_check.txt 
1427
root@cloudnet1005:~# grep -c "Query time: 0 msec" /tmp/query_check.txt 
1218
root@cloudnet1005:~# grep -c "Query time: 4 msec" /tmp/query_check.txt 
205

There were a handful of slightly longer responses, but I'm not sure a 500ms response would be causing the issues mentioned further up in this task. I think we should focus on them.

The primary/reported issue is failing dns lookups. I can reproduce that from a VM with e.g.

watch -n 0.1 -e dig @172.20.255.1 gerrit.wikimedia.org

I'm trying that on a cloudnet and so far it hasn't failed.

This is what I got this morning:

hashar@integration-agent-qemu-1003:~$ sudo mtr -r -c 100 -b 172.20.255.1
Start: 2024-12-04T08:40:01+0000
HOST: integration-agent-qemu-1003 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.e  0.0%   100    0.4   0.3   0.2   1.0   0.1
  2.|-- vlan1107.cloudgw1002.eqia  0.0%   100    0.5   0.3   0.3   0.6   0.1
  3.|-- irb-1120.cloudsw1-d5-eqia  0.0%   100    3.7   7.7   1.0  50.0   8.7
  4.|-- ns-recursor.openstack.eqi  0.0%   100    0.5  19.9   0.3 839.8 120.6

I am not sure how much I would trust ICMP packets, it used to be that on some hardware the ASIC processing of ICMP paquets would be delayed to leave room for more important packets. Then I haven't done networking for 15+ years now.

I had a trace earlier that showed high packet loss on the cloudsw1? Could that host have an issue? Are the "switches" monitored? They are Juniper hosts and should definitely have all metrics/monitoring/alarms one can imagine?

This is what I got this morning:

hashar@integration-agent-qemu-1003:~$ sudo mtr -r -c 100 -b 172.20.255.1
Start: 2024-12-04T08:40:01+0000
HOST: integration-agent-qemu-1003 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.e  0.0%   100    0.4   0.3   0.2   1.0   0.1
  2.|-- vlan1107.cloudgw1002.eqia  0.0%   100    0.5   0.3   0.3   0.6   0.1
  3.|-- irb-1120.cloudsw1-d5-eqia  0.0%   100    3.7   7.7   1.0  50.0   8.7
  4.|-- ns-recursor.openstack.eqi  0.0%   100    0.5  19.9   0.3 839.8 120.6

That one is quite interesting. Ultimately not seeing any huge problems. Hop 3 looks like it had perhaps one response out of profile, hop 4 perhaps the same (although insane RTT on the worst). But zero loss and the vast majority of replies have the expected low RTT.

I am not sure how much I would trust ICMP packets, it used to be that on some hardware the ASIC processing of ICMP paquets would be delayed to leave room for more important packets. Then I haven't done networking for 15+ years now.

Yes you are correct, usually the ASIC has policers to limit how many packets will get passed to the router CPU to be processed (to protect it). So if we see loss on the network devices (here hop 3 only, there are other devices in the path but at layer-2 so don't appear in the trace) it's more likely to be that than dropped packets in the forwarding plane.

The same is true to some extent for the end hosts. How much CPU is devoted to responding to pings and how they are queued will affect the results, which is likely to be different to how application packets are treated.

Overall I would agree, the ICMP tests are probably not revealing much here. Indeed to go back to the original problem a DNS response after 800ms would not result in a timeout (despite that being criminally long for a response), so I suspect there is something else going on here. Let's focus on trying to replicate the DNS timeouts and see if we can find a pattern.

For the record, I've just seen a DNS timeout in an unrelated project (tofuinfratest).

Context: this is a VM where we run a script once a day that uses OpenTofu to create a few Cloud VPS resources, and alert if something is not working. We have logs since April and this is the first time this error is recorded.

ssh root@tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud
root@tf-bastion:~# journalctl -t tf-infra-test -f --since today

[...]

Dec 04 12:09:16 tf-bastion tf-infra-test[2945133]: │ Unable to read prefix 21137: Get
Dec 04 12:09:16 tf-bastion tf-infra-test[2945133]: │ "https://puppet-enc.cloudinfra.wmcloud.org/v1/tofuinfratest/prefix/id/21137":
Dec 04 12:09:16 tf-bastion tf-infra-test[2945133]: │ dial tcp: lookup puppet-enc.cloudinfra.wmcloud.org on 172.20.255.1:53: read
Dec 04 12:09:16 tf-bastion tf-infra-test[2945133]: │ udp 172.16.6.36:38794->172.20.255.1:53: i/o timeout

With

sudo cumin --force O{name:"canary*"} 'set -e; for i in {1..10000}; do dig +short @172.20.255.1 gerrit.wikimedia.org; done'

I get failure on every canary after not all that long. So, the dns issue is not specific to hv in any way.

With

sudo cumin --force O{name:"canary*"} 'set -e; for i in {1..10000}; do dig +short @172.20.255.1 gerrit.wikimedia.org; done'

I get failure on every canary after not all that long. So, the dns issue is not specific to hv in any way.

Take a tcpdump on the DNS resolver (cloudservices1005 I think?) while doing this from one of the canaries, as well as on the canary side. And we can try to assess if the packets aren't reaching the resolver, or are reaching it but not being responded to, or are being responded to but the responses lost on the path etc.

Just to check: is it useful to post further examples of failed builds in here? (I’ve got plenty of them, I just haven’t been posting them recently.)

@Lucas_Werkmeister_WMDE I don't think it's necessary to post more examples, we're able to reproduce the issue albeit not with regularity. Most of the SREs are at an offsite and/or out sick this week so progress towards a fix will not be super fast unfortunately.

@bd808 suspects that this is a conntrack table issue, which fits the pattern of the failure. A table overflow should show up in logs, though, and I can't turn up anything with greps e.g.

sudo cumin 'cloud*' "dmesg | grep conntrack"

I remembered we had monitoring somewhere that would check conntrack table size.

This looks like it, from puppet repo:

modules/base/files/firewall/check_conntrack.py:        print("CRITICAL: nf_conntrack is %d %% full" % full)
modules/base/files/firewall/check_conntrack.py:        print("WARNING: nf_conntrack is %d %% full" % full)
modules/base/files/firewall/check_conntrack.py:        print("OK: nf_conntrack is %d %% full" % full)
modules/base/files/firewall/check_conntrack.py:        print("UNKNOWN: error reading nf_conntrack")

Also cat /proc/sys/net/netfilter/nf_conntrack_count should show the current size.

@bd808 suspects that this is a conntrack table issue, which fits the pattern of the failure. A table overflow should show up in logs, though, and I can't turn up anything with greps e.g.

On which system is there a suspected conntrack problem?

These are exposed by the Prometheus node exporter and can be graphed, under these names:

node_nf_conntrack_entries
node_nf_conntrack_entries_limit

We have them on this dashboard, checking for the physical hosts I know are involved here I'm not seeing anything suspicious, but perhaps the limit is getting hit somewhere else:

https://grafana.wikimedia.org/goto/ORHaAOIHg

Ok so I did 100,000 queries from a VM instance I have access to (netops-clab running on cloudvirt1046) and had zero timeouts.

cmooney@wikilap:~$ grep -c "Query time" dig_checks2.txt 
100000
cmooney@wikilap:~$ grep -ci "timeout" dig_checks2.txt 
0

The pcap's I took won't reveal anything here as every query was answered.

Of the results 99.91% were under 10ms response time, with the average being 1.4ms. 85 in total were higher than 10ms, with one as high as 1.5 seconds. I really don't see any signs of an issue overall however. Bottom line the network and dns server appear to be fine it looks like a red herring to me.

I'd also question the relevance of being able to produce dns timeouts by swamping the resolver/network using multiple hosts in parallel making queries as quickly as they can. The resolver, like any other system, will have an upper limit to how many queries it can answer, and we can probably engineer a way to hit that limit if we want.

The user-reported problems are occurring when we have normal usage levels, so we should be able to reproduce the issue under normal load conditions.

@hashar do the containers in question map the system /etc/resolv.conf such that if I alter it the change will take effect immediately in the containers? Or do I need to alter the container config somehow?

@hashar do the containers in question map the system /etc/resolv.conf such that if I alter it the change will take effect immediately in the containers? Or do I need to alter the container config somehow?

Presenting resolv.conf into the container should all be taken care of by the Docker runtime, so it should "just work". https://docs.docker.com/engine/network/#dns-services

Change #1105332 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps dns recursors: increase # of threads x 3

https://gerrit.wikimedia.org/r/1105332

Change #1105332 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps dns recursors: increase # of threads x 3

https://gerrit.wikimedia.org/r/1105332

Using hashar's test

sudo grep -l 'Could not resolve host' /srv/jenkins/builds/*quibble*/*/log|xargs ls --full-time -rt|awk '{ print $6 " " $7 }'

we seem to have now gone 24 hours without a failure. That's not definitive (or unprecedented) but it is encouraging. Two things have changed since the last logged failure:

  • Increased threads for the recursors (see attached patch above)
  • forced failover from a distressed cloudgw host T382220

The last DNS failures I saw were indeed just over 24 hours ago (13:30 UTC yesterday), but slightly after that Puppet change was merged IIUC (but maybe before it rolled out everywhere?).

The last DNS failures I saw were indeed just over 24 hours ago (13:30 UTC yesterday), but slightly after that Puppet change was merged IIUC (but maybe before it rolled out everywhere?).

Rolling out the puppet change caused the dns recursors to restart, I suspect that's what we're seeing with those last failures.

2024-12-19 16:33:54.691933412
2024-12-19 16:41:52.098975077
2024-12-19 16:41:57.195007683

still happening

FWIW I left another test running overnight.

1.65 million queries, two timeouts. Which is like 5-nines reliability.

cmooney@netops-clab:~$ grep -c "timed out" dig_checks_again2.txt 
2
cmooney@netops-clab:~$ grep -c "Query time" dig_checks_again2.txt 
1643908

Time Outs

Of the two timed out (dig logs here: P71733), the first was query id 0x4edb at 08:20:44 (Dec 20th). The capture shows this packet was created, and indeed it was re-transmitted two more times, 5 seconds later and 10 seconds later. None were answered. The other one that went unanswered was a short time afterwards, query 0xc2a3 at 08:21:00. Same pattern. I wasn't capturing on cloudservice1006 unfortunately to see if these were received. We can perhaps do it again if we deem it worthwhile.

Odd that these come closely together in time, and that we have effectively 3 missed packets in total each time. @Andrew do any logs on the dns server show issues around then? I see a small spike in cpu around that time, but these are common enough and the busiest CPU core still only reported 10% use over the time on average.

What is also interesting, however, is that other dns queries were sent, and got answers, during the same window (these weren't part of my test just the system itself or other things running on it doing so). So during the 15 second window the 3 unanswered queries were sent we have other queries send and answers received. However I do see one re-transmitted request due to not receiving an answer within 1 second, so it does maybe seen the system was under strain at that point.

Original Problem

My question would still be whether that level of failure (1 in a million queries) could account for the issues we have seen? It seems unlikely to me that that level of failure would cause these noticeable user-facing issues. Broadly my thinking remains:

  • The reliability of the network and the dns service overall is very high, there is not evidence of widespread problems
    • If we send millions of queries over an extended period a handful go unanswered but 99.999% are ok
  • That level of failure seems unlikely to be the root cause of the build problems described
  • It seems likely, rather than a general problem, there is a complicating factor somewhere on specific systems, interfaces, applications etc
    • We should make our focus the specific systems reporting the dns lookup failures
    • We should also analyse the timing of problems closely to see if perhaps there is some pattern in that dimension

I'm willing to entertain there may be some transient problem very very occasionally, in which case let's look for a time pattern. But there is definitely no overall, constant networking problem so let's hone in on the particular issue and systems reported.

Thank you for pursuing this, @cmooney

I don't see anything interesting in the pdns-recursor server logs at 08:20. The noise in this log snippet seems to be from an unrelated badly-behaved client.

Dec 20 08:12:28 cloudservices1006 pdns-recursor[963467]: msg="Sending SERVFAIL during resolve" error="Too much time waiting for civilwarwiki.net|A, timeouts: 5, throt
tles: 2, queries: 6, 7514msec" subsystem="syncres" level="0" prio="Notice" tid="7" ts="1734682348.575" ecs="" mtid="5873662" proto="udp" qname="civilwarwiki.net" qtyp
e="A" remote="185.15.56.22:65240"
Dec 20 08:16:27 cloudservices1006 pdns-recursor[963467]: msg="Discarding unexpected packet" subsystem="in" level="0" prio="Warning" tid="12" ts="1734682587.653" from=
"199.167.66.1" proto="udp" qname="www.thehenryfor\\000.\\000.\\000" qtype="TYPE10512" waiters="1"
Dec 20 08:16:28 cloudservices1006 pdns-recursor[963467]: msg="Discarding unexpected packet" subsystem="in" level="0" prio="Warning" tid="12" ts="1734682588.889" from="199.167.66.2" proto="udp" qname="www.thehenryfor\\000.\\000.\\000" qtype="TYPE10512" waiters="2"
Dec 20 08:25:14 cloudservices1006 pdns-recursor[963467]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1734683114.176" cache-entries="931543" negcache-entries="4057" questions="385533302" record-cache-acquired="1247674657" record-cache-contended="4334759" record-cache-contended-perc="0.347427" record-cache-hitratio-perc="75"
Dec 20 08:25:14 cloudservices1006 pdns-recursor[963467]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1734683114.176" edns-entries="397" failed-host-entries="142" non-resolving-nameserver-entries="3" nsspeed-entries="6160" outqueries-per-query="6" saved-parent-ns-sets-entries="3736" throttle-entries="46"

I also don't see evidence that there was a service failover or unusual network traffic/dns traffic at that time.

A cI job failed with a dns failure at 8:13

Not exactly the same moment as your timeouts.

2024-12-20 08:13:05.540382101

Change #1105945 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: increase # of attempts with dns resolving

https://gerrit.wikimedia.org/r/1105945

@Andrew it also occurs to me if anyone has done any analysis to see if there is a pattern in which cloudservices node the hosts in question are hitting?

If you're not aware the anycast setup here means that devices in C8/D5 will route to the local cloudservices host in their own rack if sending traffic to 172.20.255.1. For hosts in E4/F4 traffic is load balanced based on source IP, so one host will hist 1005 and another 1006.

cmooney@cloudvirt1063:~$ dig +nsid www.google.ie @172.20.255.1 

; <<>> DiG 9.18.28-1~deb12u2-Debian <<>> +nsid www.google.ie @172.20.255.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8140
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 63 6c 6f 75 64 73 65 72 76 69 63 65 73 31 30 30 35 ("cloudservices1005")
;; QUESTION SECTION:
;www.google.ie.			IN	A

;; ANSWER SECTION:
www.google.ie.		266	IN	A	172.253.115.94

;; Query time: 0 msec
;; SERVER: 172.20.255.1#53(172.20.255.1) (UDP)
;; WHEN: Sat Dec 21 11:36:11 UTC 2024
;; MSG SIZE  rcvd: 79
cmooney@cloudvirt1064:~$ dig +nsid www.google.ie @172.20.255.1

; <<>> DiG 9.18.28-1~deb12u2-Debian <<>> +nsid www.google.ie @172.20.255.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32789
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 63 6c 6f 75 64 73 65 72 76 69 63 65 73 31 30 30 36 ("cloudservices1006")
;; QUESTION SECTION:
;www.google.ie.			IN	A

;; ANSWER SECTION:
www.google.ie.		300	IN	A	142.251.163.94

;; Query time: 4 msec
;; SERVER: 172.20.255.1#53(172.20.255.1) (UDP)
;; WHEN: Sat Dec 21 11:36:22 UTC 2024
;; MSG SIZE  rcvd: 79

From a VPS instance the traffic routes out via the active cloudgw, so for instance in the tests I did from netops-clab (on cloudvirt1046 in D5) routed via cloudgw1001 in rack C8 and thus hit cloudservices1006 in C8.

It may be an idea in the new year to try and introduce some diversity here. We could announce a second anycast IP from both cloudservices nodes to be used for DNS queries, and update resolv.conf on instances to use both. We can use a few BGP knobs then to make one cloudservice node primary for one IP and vice versa. cc @aborrero

As I said on the patch I'm not sure increasing the number of retries is the way to go. We really shouldn't tolerate failures. But in keeping with DNS principles it does help if any retries that are sent land on another server.

Change #1106324 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] role/profile for a simple vps-hosted dns recursor

https://gerrit.wikimedia.org/r/1106324

Change #1106324 merged by Andrew Bogott:

[operations/puppet@production] role/profile for a simple vps-hosted dns recursor

https://gerrit.wikimedia.org/r/1106324

For science (and to be totally sure that this failure is upstream from the integration project), I have made a project-local dns recursor in the Integration project.

At the moment it will only work for clients within Integration. To use it, add this to hiera for a VM or prefix:

profile::resolving::nameservers:
- integration-dnsrecursor-temp.integration.eqiad1.wikimedia.cloud

This is not meant as a long-term solution, but I'm interested in whether or not it resolves the test failures. If it does then it gives us a temporary fix and also will help convince me that whatever is going on is outside the Integration project and (likely) outside the cloud-vps network entirely (or on the border between the cloudvps network and prod hosts).

@Andrew we have the same DNS issue for queries originating from Beta-Cluster-Infrastructure but it surely does not harm to rule it out. Also if the issue occurs there as well, it might be easier to debug given it would only have queries from a single project. So please do :)

We have had the temporary DNS issues in other projects (not integration) as well in the past. And it doesn't seem to match the super low number (five nines) that Cathal saw in his tests. I am wondering if there is something a certain subset of projects has in common that other projects don't have.

We're now using the project-local recursor for integration-agent-docker-*. This won't really give us an answer but it will at least provide a big more diagnostic info.

I've also been poring over network and dns dashboards again trying to find a hint of distress to correspond with the these failure clusters:

2025-01-02 14:24:13.357124596
2025-01-02 14:24:16.201141502
2025-01-02 14:24:20.817168955
2025-01-02 14:46:31.533503176
2025-01-02 14:46:33.581516283
2025-01-02 14:46:36.309533743
2025-01-02 14:46:47.337604333

but I don't see anything at all interesting going on.

For those following along with extra energy, here are things I'd be interested in:

  • Are failures within a given project evenly distributed over all the VMs, or concentrated on a few?
  • Can anyone provide me with a cli test that reproduces the non-DNS network failures that we seem to be seeing on toolforge?

We created a silly script to count udp packages xd

dcaro@tools-bastion-13:~$ cat count_udp.sh 
last=0
nc -k -u -l 1234 \
| while read next; do
    echo "Got $next"
    last="$(($last + 1))"
    if [[ "$last" != "$next" ]]; then
        echo "Got error: $next"
    fi
    last=$next
done

On the client side:

dcaro@tools-bastion-13:~$ for i in 0 $(seq 100); do echo "$i" | nc -q0 -u 127.0.0.1 1234; sleep 1; done

We can try using it to see if there's any missing/drops between the hosts (Dns recursor/client/etc.)

I ran that script for a very long time yesterday (between cloudcontrol1005 and util-abogott-bookworm) and it didn't record any failures. It is really hard to find evidence that this is a general network failure.

Recursor queries average between 2k and 2.5k per second. When I force that to spike up 4.3k/second the server handles things fine, no sign of failure. I also don't see any correspondence between failures and failover between the two servers or between spikes in traffic.

For better or for worse, I'm eliminated my own best test case for this. I assume the issue is still present in non-Integration projects but I don't have a great way to assess. @hashar do you have a good way for me to enumerate dns and/or network failures in deployment-prep?

I could of course switch Integration back off the local recursor but it seems rude to break tests.

I retried the api server test (kubectl logs --timestamps -n kube-system pod/kube-apiserver-tools-k8s-control-7 | grep timeout) and don't see any failures currently. It is somewhat possible that our new cloudgw node (T382356) has fixed this issue. Is anyone seeing it anymore?

We've now replaced both cloudgw nodes.

I've removed the special dns-recursor in the integration project. We will see if these failures return.

This seems to have been fixed with the cloudgw replacement.

Change #1105945 abandoned by Andrew Bogott:

[operations/puppet@production] cloud-vps: increase # of attempts with dns resolving

Reason:

With T374830 resolved this is no longer interesting.

https://gerrit.wikimedia.org/r/1105945

We've now replaced both cloudgw nodes.
I've removed the special dns-recursor in the integration project.

Thank you for the replacement and to have cleaned up the one-off DNS config from integration :)

We will see if these failures return.

Looking at the last 7 days yields nothing:

ssh contint.wikimedia.org find /srv/jenkins/builds -maxdepth 3 -name log -ctime -7 -exec grep "'Could not resolve host'" {} \+

Thank you!!

Looking at the last 7 days yields nothing:

Thanks for double-checking, it's nice to have confirmation the things are now working properly.

Thanks a lot for your work, I can tell this was super difficult to figure out <3