Page MenuHomePhabricator

ProbeDown tools http_this_tool_does_not_exist_toolforge_org_ip4: tool fourohfour flapping
Closed, ResolvedPublic

Description

Last week it flapped a few times, but this weekend it has been flapping way more often.

From emails:

[Cloud-admin-feed] [FIRING:1] ProbeDown tools (172.16.4.136 ip4 tools-k8s-haproxy-4:30000 probes/custom http_this_tool_does_not_exist_toolforge_org_ip4 warning prometheus wmcs)

[1] Firing
Labels
alertname = ProbeDown
address = 172.16.4.136
family = ip4
instance = tools-k8s-haproxy-4:30000
job = probes/custom
module = http_this_tool_does_not_exist_toolforge_org_ip4
project = tools
severity = warning
source = prometheus
team = wmcs
Annotations
dashboard = https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All
description = tools-k8s-haproxy-4:30000 failed when probed by http_this_tool_does_not_exist_toolforge_org_ip4 from . Availability is 0%.
logs = https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_this_tool_does_not_exist_toolforge_org_ip4)))))
runbook = https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000
summary = Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4)

Same also with tools-k8s-haproxy-3.

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Found a 503 on the nginx proxy logs, but just one:

root@tools-k8s-control-4:~# kubectl logs -n ingress-nginx-gen2  deployment.apps/ingress-nginx-gen2-controller | grep fourohfour | grep '50[0-9]'
Found 3 pods, using pod/ingress-nginx-gen2-controller-5476b745d7-s87z4
192.168.247.64 - - [01/May/2023:11:35:36 +0000] "HEAD / HTTP/1.1" 503 0 "https://www.bing.com" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.2 Safari/537.36" 378 0.119 [tool-fourohfour-fourohfour-8000] [] 192.168.21.207:8000 0 0.116 503 6c3102f122c1cfe18040611c512209c1

Curling that IP looks ok though

Oh, got some issues when curling:

root@tools-k8s-control-4:~# curl -X HEAD -v http://192.168.21.207:8000 > /dev/null
Warning: Setting custom HTTP method to HEAD with -X/--request may not work the
Warning: way you want. Consider using -I/--head instead.
* Expire in 0 ms for 6 (transfer 0x5565788b7110)
*   Trying 192.168.21.207...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5565788b7110)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 192.168.21.207 (192.168.21.207) port 8000 (#0)
> HEAD / HTTP/1.1
> Host: 192.168.21.207:8000
> User-Agent: curl/7.64.0
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 404 NOT FOUND
< Content-Type: text/html; charset=utf-8
< Content-Length: 4255
<
  0  4255    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0{ [0 bytes data]
* transfer closed with 4255 bytes remaining to read
  0  4255    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
* Closing connection 0
curl: (18) transfer closed with 4255 bytes remaining to read

The connection was closed, and it took a while to reply

Oh... this is why:

tools.fourohfour@tools-sgebastion-10:~$ tail  -n 100 uwsgi.log
...
Mon May  1 13:18:47 2023 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Mon May  1 13:18:48 2023 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (101/100) ***
Mon May  1 13:18:49 2023 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***

Mentioned in SAL (#wikimedia-cloud) [2023-05-01T13:27:37Z] <dcaro> scaled up the deployment to 4 replicas (T335680)

That seemed to help, let's wait a bit if it stabilizes

Mentioned in SAL (#wikimedia-cloud) [2023-05-01T15:29:59Z] <dcaro> scaled up to 6 replicas (T335680)

While doing strace on one of the processes from a tools-k8s-worker, I noticed that it was creating many ldap connections, and each took a bit of time, so sent a patch to avoid it.

There's definitely something happening every day at ~17:00UTC, let's see if today is any better:

Screenshot from 2023-05-02 11-26-44.png (327×1 px, 72 KB)

So no alerts got triggired, and cpu usage has become way lower (~20% of the original):

image.png (383×2 px, 114 KB)

And memory usage has decreased too (~70% of the original):

image.png (383×2 px, 153 KB)