Page MenuHomePhabricator

[toolforge] several tools get periods of connection refused (104) when connecting to wikis
Closed, ResolvedPublic

Description

We are seeing some tools having intermittent 104 errors, in durations of several minutes to several wikis at a time.

See children tasks for details.

This is caused by us hitting the hard limit of 500 concurrent connections to the CDN nodes (shared by all the wikis), from a single IP (k8s worker node).

Things to do:

Event Timeline

It seems both tools are running on the new nfs k8s workers:

root@tools-k8s-control-6:~# kubectl describe -n tool-listeria pods | grep worker
Node:         tools-k8s-worker-nfs-6/172.16.7.138
Node:         tools-k8s-worker-nfs-5/172.16.6.69


root@tools-k8s-control-6:~# kubectl get event  -n tool-chie-bot | grep worker
6m25s       Normal    Scheduled              pod/job-archive-28443480-rvkhk         Successfully assigned tool-chie-bot/job-archive-28443480-rvkhk to tools-k8s-worker-nfs-3
27m         Normal    Scheduled              pod/job-dewikify-28443459-xbp4m        Successfully assigned tool-chie-bot/job-dewikify-28443459-xbp4m to tools-k8s-worker-nfs-4
56m         Normal    Scheduled              pod/job-dykc-only-new-28443430-t2wzm   Successfully assigned tool-chie-bot/job-dykc-only-new-28443430-t2wzm to tools-k8s-worker-nfs-2
51m         Normal    Scheduled              pod/job-dykc-only-new-28443435-4swrx   Successfully assigned tool-chie-bot/job-dykc-only-new-28443435-4swrx to tools-k8s-worker-nfs-6
46m         Normal    Scheduled              pod/job-dykc-only-new-28443440-svh46   Successfully assigned tool-chie-bot/job-dykc-only-new-28443440-svh46 to tools-k8s-worker-nfs-4
41m         Normal    Scheduled              pod/job-dykc-only-new-28443445-2ffxg   Successfully assigned tool-chie-bot/job-dykc-only-new-28443445-2ffxg to tools-k8s-worker-nfs-6
36m         Normal    Scheduled              pod/job-dykc-only-new-28443450-t5h9s   Successfully assigned tool-chie-bot/job-dykc-only-new-28443450-t5h9s to tools-k8s-worker-nfs-4
31m         Normal    Scheduled              pod/job-dykc-only-new-28443455-d7sdb   Successfully assigned tool-chie-bot/job-dykc-only-new-28443455-d7sdb to tools-k8s-worker-nfs-6
26m         Normal    Scheduled              pod/job-dykc-only-new-28443460-xsddm   Successfully assigned tool-chie-bot/job-dykc-only-new-28443460-xsddm to tools-k8s-worker-nfs-2
21m         Normal    Scheduled              pod/job-dykc-only-new-28443465-8s5l8   Successfully assigned tool-chie-bot/job-dykc-only-new-28443465-8s5l8 to tools-k8s-worker-nfs-4
16m         Normal    Scheduled              pod/job-dykc-only-new-28443470-p86qf   Successfully assigned tool-chie-bot/job-dykc-only-new-28443470-p86qf to tools-k8s-worker-nfs-3
11m         Normal    Scheduled              pod/job-dykc-only-new-28443475-h6wlb   Successfully assigned tool-chie-bot/job-dykc-only-new-28443475-h6wlb to tools-k8s-worker-nfs-4
6m42s       Normal    Scheduled              pod/job-dykc-only-new-28443480-bzzsq   Successfully assigned tool-chie-bot/job-dykc-only-new-28443480-bzzsq to tools-k8s-worker-nfs-6
101s        Normal    Scheduled              pod/job-dykc-only-new-28443485-lfzw7   Successfully assigned tool-chie-bot/job-dykc-only-new-28443485-lfzw7 to tools-k8s-worker-nfs-2
16m         Normal    Scheduled              pod/job-fls-28443470-m48qm             Successfully assigned tool-chie-bot/job-fls-28443470-m48qm to tools-k8s-worker-nfs-3
56m         Normal    Scheduled              pod/job-itns-28443430-t6mkc            Successfully assigned tool-chie-bot/job-itns-28443430-t6mkc to tools-k8s-worker-nfs-1
51m         Normal    Scheduled              pod/job-itns-28443435-98twj            Successfully assigned tool-chie-bot/job-itns-28443435-98twj to tools-k8s-worker-nfs-2
46m         Normal    Scheduled              pod/job-itns-28443440-mwcvc            Successfully assigned tool-chie-bot/job-itns-28443440-mwcvc to tools-k8s-worker-nfs-6
41m         Normal    Scheduled              pod/job-itns-28443445-p4lvm            Successfully assigned tool-chie-bot/job-itns-28443445-p4lvm to tools-k8s-worker-nfs-5
36m         Normal    Scheduled              pod/job-itns-28443450-fqqdr            Successfully assigned tool-chie-bot/job-itns-28443450-fqqdr to tools-k8s-worker-nfs-2
31m         Normal    Scheduled              pod/job-itns-28443455-2c9zc            Successfully assigned tool-chie-bot/job-itns-28443455-2c9zc to tools-k8s-worker-nfs-6
26m         Normal    Scheduled              pod/job-itns-28443460-v52wl            Successfully assigned tool-chie-bot/job-itns-28443460-v52wl to tools-k8s-worker-nfs-5
21m         Normal    Scheduled              pod/job-itns-28443465-5pxkc            Successfully assigned tool-chie-bot/job-itns-28443465-5pxkc to tools-k8s-worker-nfs-4
16m         Normal    Scheduled              pod/job-itns-28443470-v246z            Successfully assigned tool-chie-bot/job-itns-28443470-v246z to tools-k8s-worker-nfs-3
11m         Normal    Scheduled              pod/job-itns-28443475-llkxb            Successfully assigned tool-chie-bot/job-itns-28443475-llkxb to tools-k8s-worker-nfs-6
6m43s       Normal    Scheduled              pod/job-itns-28443480-btpsm            Successfully assigned tool-chie-bot/job-itns-28443480-btpsm to tools-k8s-worker-nfs-2
100s        Normal    Scheduled              pod/job-itns-28443485-bxcvj            Successfully assigned tool-chie-bot/job-itns-28443485-bxcvj to tools-k8s-worker-nfs-2
53m         Normal    Scheduled              pod/job-rfd-28443433-9h5hs             Successfully assigned tool-chie-bot/job-rfd-28443433-9h5hs to tools-k8s-worker-nfs-6
43m         Normal    Scheduled              pod/job-rfd-28443443-227vm             Successfully assigned tool-chie-bot/job-rfd-28443443-227vm to tools-k8s-worker-nfs-4
33m         Normal    Scheduled              pod/job-rfd-28443453-2z9gw             Successfully assigned tool-chie-bot/job-rfd-28443453-2z9gw to tools-k8s-worker-nfs-4
23m         Normal    Scheduled              pod/job-rfd-28443463-mbv9v             Successfully assigned tool-chie-bot/job-rfd-28443463-mbv9v to tools-k8s-worker-nfs-6
13m         Normal    Scheduled              pod/job-rfd-28443473-q5dbd             Successfully assigned tool-chie-bot/job-rfd-28443473-q5dbd to tools-k8s-worker-nfs-4
3m41s       Normal    Scheduled              pod/job-rfd-28443483-wfczh             Successfully assigned tool-chie-bot/job-rfd-28443483-wfczh to tools-k8s-worker-nfs-6
9m42s       Normal    Scheduled              pod/job-sas-28443477-p7556             Successfully assigned tool-chie-bot/job-sas-28443477-p7556 to tools-k8s-worker-nfs-6

And there's no reports from other parties on prod side.
The ips for chie-bot from the haproxy side: https://w.wiki/8$mx

I'm running a script to try to trigger the issue and capture a pcap:

root@tools-k8s-control-6:~# kubectl run -ti --rm --image docker-registry.tools.wmflabs.org/toolforge-bash:5.1.4 --command bash --overrides='{"apiVersion": "v1", "spec": {"nodeSelector": { "kubernetes.io/hostname": "tools-k8s-worker-nfs-6" }}}'
...
# intstall tcpdump and curl with apk add <pkg>
...

# cat check_wiki_connections.sh 
#!/usr/bin/env bash

WIKI_URL='https://commons.wikimedia.org/w/api.php?action=query&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json&meta=siteinfo'


main() {
    local tcpdump_pid
    local return_code
    while true; do
        tcpdump -nvvi any -s 2048 -w test.pcap &
        tcpdump_pid="$!"
        sleep .5
        return_code="$(curl -v "$WIKI_URL" -o stdout --write-out "%{http_code}" 2>stderr)"
        sleep .5
        if [[ "$return_code" != "200" ]]; then
            echo "Got one!"
            kill "$tcpdump_pid"
            return
        fi
        kill "$tcpdump_pid"
        echo "Nothing new... waiting 60s"
        sleep 60
    done
}


main

let's see if I get anything

last time this happened for me was: 03 Feb 2024 03:15 UTC

Unfortunately, I was not able to capture that issue :/ (the script did not fail), thinking on how to debug more in detail.

@Leloiandudu do you have the snippet of code that does the request? or even better, the request itself?

dcaro triaged this task as High priority.Feb 7 2024, 10:12 AM

I have found that it happens also on old nodes, for example, rustbot (from the listeria tool):

tools.listeria@tools-sgebastion-10:~$ tail -f rustbot.err 
...
Started taling here at 11:25:00 CEST

Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics): operation timed out
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&format=json&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics): operation timed out
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&format=json&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&meta=siteinfo): connection error: Connection reset by peer (os error 104)
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&meta=siteinfo&format=json): connection error: Connection reset by peer (os error 104)
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&format=json&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&meta=siteinfo): operation timed out
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json&meta=siteinfo&action=query): connection error: Connection reset by peer (os error 104)
Error: error sending request for url (https://commons.wikimedia.org/w/api.php?action=query&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&meta=siteinfo&format=json): connection error: Connection reset by peer (os error 104)
root@tools-k8s-control-6:~# kubectl get pod -o yaml -n tool-listeria rustbot-6548cb7b94-749wm | grep worker
  nodeName: tools-k8s-worker-43

Looking, as I have the issue happening right now live

Hmpf... it was just killed by OOM:

[Wed Feb  7 10:26:56 2024] Memory cgroup out of memory: Kill process 2443 (bot) score 1684 or sacrifice child
[Wed Feb  7 10:26:56 2024] Killed process 2443 (bot) total-vm:7466120kB, anon-rss:5088560kB, file-rss:8344kB, shmem-rss:0kB

I was able to run a pcap on a full request:

tools.listeria@rustbot-6548cb7b94-749wm:~$ curl -v 'https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 208.80.154.224:443...
* Connected to commons.wikimedia.org (208.80.154.224) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [4492 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [79 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.wikipedia.org
*  start date: Dec 20 05:26:47 2023 GMT
*  expire date: Mar 19 05:26:46 2024 GMT
*  subjectAltName: host "commons.wikimedia.org" matched cert's "*.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x55f9048fe310)
} [5 bytes data]
> GET /w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json HTTP/2
> Host: commons.wikimedia.org
> user-agent: curl/7.74.0
> accept: */*
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
} [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:--  0:03:32 --:--:--     0* OpenSSL SSL_read: Connection reset by peer, errno 104
* Failed receiving HTTP2 data
} [5 bytes data]
* OpenSSL SSL_write: Broken pipe, errno 32
* Failed sending HTTP2 data
  0     0    0     0    0     0      0      0 --:--:--  0:03:32 --:--:--     0
* Connection #0 to host commons.wikimedia.org left intact
curl: (56) OpenSSL SSL_read: Connection reset by peer, errno 104

That's from within the container for rustbot, but from outside, from the worker it also happens (tools-k8s-worker-43).

Trying from a different workers works without issues:

dcaro@tools-k8s-worker-44:~$ curl -v 'https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 208.80.154.224...
* TCP_NODELAY set
* Connected to commons.wikimedia.org (208.80.154.224) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [4492 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [79 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.wikipedia.org
*  start date: Dec 20 05:26:47 2023 GMT
*  expire date: Mar 19 05:26:46 2024 GMT
*  subjectAltName: host "commons.wikimedia.org" matched cert's "*.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x5654d4842110)
} [5 bytes data]
> GET /w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json HTTP/2
> Host: commons.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
} [5 bytes data]
< HTTP/2 200 
< date: Wed, 07 Feb 2024 10:38:00 GMT
< server: mw1400.eqiad.wmnet
< x-content-type-options: nosniff
< x-search-id: 7so1mcfzf5wz7h6ut5r0bnzgy
< x-frame-options: DENY
< content-disposition: inline; filename=api-result.json
< vary: Accept-Encoding,Treat-as-Untrusted,X-Forwarded-Proto,Cookie
< cache-control: private, must-revalidate, max-age=0
< content-type: application/json; charset=utf-8
< age: 0
< x-cache: cp1108 miss, cp1108 pass
< x-cache-status: pass
< server-timing: cache;desc="pass", host;desc="cp1108"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
< nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
< set-cookie: WMF-Last-Access=07-Feb-2024;Path=/;HttpOnly;secure;Expires=Sun, 10 Mar 2024 00:00:00 GMT
< x-client-ip: 172.16.1.118
< set-cookie: GeoIP=:::::v4; Path=/; secure; Domain=.wikimedia.org
< set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600
< accept-ranges: bytes
< 
{ [13669 bytes data]
100 86007    0 86007    0     0   839k      0 --:--:-- --:--:-- --:--:--  839k
* Connection #0 to host commons.wikimedia.org left intact

Mtr from an affected worker (43) and a non-affected one (44) look the same:

### 43
root@tools-k8s-worker-43:~# mtr -P 443 -T -w 208.80.154.224
Start: 2024-02-07T10:41:54+0000
HOST: tools-k8s-worker-43                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs  0.0%    10    0.7   0.5   0.3   0.7   0.1
  2.|-- 185.15.56.233                          0.0%    10    0.7   0.7   0.5   0.9   0.1
  3.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- xe-3-0-4-1102.cr1-eqiad.wikimedia.org  0.0%    10    1.6   1.4   0.6   4.3   1.1
  5.|-- text-lb.eqiad.wikimedia.org            0.0%    10    0.7   1.1   0.5   2.5   0.6

### 44
dcaro@tools-k8s-worker-44:~$ sudo mtr -P 443 -T -w 208.80.154.224
Start: 2024-02-07T10:42:02+0000
HOST: tools-k8s-worker-44                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs  0.0%    10    0.7   0.7   0.4   0.8   0.1
  2.|-- 185.15.56.233                          0.0%    10    0.4   0.8   0.4   1.7   0.3
  3.|-- ???                                   100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- xe-3-0-4-1102.cr1-eqiad.wikimedia.org  0.0%    10    0.7   2.1   0.7   6.3   1.8
  5.|-- text-lb.eqiad.wikimedia.org            0.0%    10    0.8   0.7   0.6   0.9   0.1

So it seems that the bots are triggering the silent-drop limits on the haproxies:

vgutierrez> Valentin Gutierrez unles....
12:52:23 Feb 07 11:41:23 cp1106 haproxy[1857819]: [debug] silent-drop_for_300s: type=ipv4 <172.16.6.13>
12:52:57 
<vgutierrez> Valentin Gutierrez the bot is so aggressive that's triggering the silent-drop protection mechanism from HAProxy

The limits have not been changed lately though

So yes, we have a strong limit on the CDN layer of 500 concurrent connections per-ip, and each worker node has only a single IP, so all the aggregated traffic from all the tools from a single node should be <500 connections, otherwise the whole node gets silently dropped for 300s at a time.

Maybe an idea: have a per-tool network quota for concurrent connections. We don't have any semantics in kubernetes/calico for implementing this though.

dcaro updated the task description. (Show Details)

Maybe an idea: have a per-tool network quota for concurrent connections. We don't have any semantics in kubernetes/calico for implementing this though.

Can you open a task with more details if you have a clear idea? I'm going to close this one for now, but would be nice to be able to have something more than us looking at the limits.

Maybe an idea: have a per-tool network quota for concurrent connections. We don't have any semantics in kubernetes/calico for implementing this though.

Can you open a task with more details if you have a clear idea? I'm going to close this one for now, but would be nice to be able to have something more than us looking at the limits.

T363296: toolforge: explore options to introduce egress network quotas