Page MenuHomePhabricator

toolforge: ingress errors 2025-03-05
Closed, ResolvedPublic

Description

The Toolforge HTTP ingress setup had some problems today.

Alerts flapping for haproxy backends going down, ex:

image.png (200×821 px, 61 KB)

The trouble started today at ~1:00 am UTC:

https://tools-prometheus.wmflabs.org/tools/graph?g0.expr=haproxy_server_up%20%3C%201&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=12h

image.png (803×1 px, 52 KB)

I was able to reproduce the issue by just refreshing any tool page, or doing a manual curl like:

root@tools-k8s-haproxy-5:~# time curl -v http://tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud:30002

And it timing out sometimes (but responding quick others).

There did not seem to be any issues with memory or cpu limits.

I restarted the pods from the ingress-nginx-gen2 deployment and things improved, I can't manually reproduce anymore and haproxy retries and errors went down:

from https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1&from=now-24h&to=now&var-host=tools-k8s-haproxy-6&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&refresh=5m
https://usercontent.irccloud-cdn.com/file/5pNj6dqz/image.png
https://usercontent.irccloud-cdn.com/file/5QJjU7QI/image.png

But alerts are still flapping.

Event Timeline

dcaro updated the task description. (Show Details)

None of the ingress nodes is running on cloudvirtt1039 (the one currently having conntrack issues), so those two things seem unrelated.

The backend retries and errors reported by haproxy have gone to 0:

image.png (470×785 px, 60 KB)

But it's still flapping the backends (taking the in/out), I was unable to manually make a connection fail:

dcaro@tools-k8s-haproxy-6:~$ count=0; while sleep 0.1; do nc -z tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 30002 || break; count=$((count+1)); echo "$count passed"; done; echo "Passed $count times before failing";

We are going to try two things at the same time, reduce the logging (journal is rotating ~5-15m due to haproxy being verbose), and just a restart of the service.

I see a bunch of TCP SYN packets from haproxy into the ingress worker that go without reply:

10:50:00.220094 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42240 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1363260621, win 42340, options [mss 1460,sackOK,TS val 1580764268 ecr 0,nop,wscale 9], length 0
10:50:01.225352 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42240 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1363260621, win 42340, options [mss 1460,sackOK,TS val 1580765274 ecr 0,nop,wscale 9], length 0
10:50:03.241313 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42240 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1363260621, win 42340, options [mss 1460,sackOK,TS val 1580767290 ecr 0,nop,wscale 9], length 0
10:50:06.222719 ens3  In  IP tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud.36380 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1118636180, win 42340, options [mss 1460,sackOK,TS val 2539480610 ecr 0,nop,wscale 9], length 0
10:50:07.239059 ens3  In  IP tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud.36380 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1118636180, win 42340, options [mss 1460,sackOK,TS val 2539481627 ecr 0,nop,wscale 9], length 0
10:50:07.369385 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42240 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1363260621, win 42340, options [mss 1460,sackOK,TS val 1580771418 ecr 0,nop,wscale 9], length 0
10:50:08.215759 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42254 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 3791284794, win 42340, options [mss 1460,sackOK,TS val 1580772264 ecr 0,nop,wscale 9], length 0
10:50:09.225321 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42254 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 3791284794, win 42340, options [mss 1460,sackOK,TS val 1580773274 ecr 0,nop,wscale 9], length 0
10:50:09.255138 ens3  In  IP tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud.36380 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1118636180, win 42340, options [mss 1460,sackOK,TS val 2539483643 ecr 0,nop,wscale 9], length 0
10:50:11.241445 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.42254 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 3791284794, win 42340, options [mss 1460,sackOK,TS val 1580775290 ecr 0,nop,wscale 9], length 0
10:50:13.216949 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.55942 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 4264134500, win 42340, options [mss 1460,sackOK,TS val 1580777265 ecr 0,nop,wscale 9], length 0
10:50:13.415139 ens3  In  IP tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud.36380 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 1118636180, win 42340, options [mss 1460,sackOK,TS val 2539487803 ecr 0,nop,wscale 9], length 0
10:50:14.217435 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.55942 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 4264134500, win 42340, options [mss 1460,sackOK,TS val 1580778266 ecr 0,nop,wscale 9], length 0
10:50:15.221783 ens3  In  IP tools-k8s-haproxy-5.tools.eqiad1.wikimedia.cloud.55958 > tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud.30003: Flags [S], seq 4290929465, win 42340, options [mss 1460,sackOK,TS val 1580779270 ecr 0,nop,wscale 9], length 0

we restarted a few api-gateway pods and things started to work again. The current theory is that restarting some pods flushed some rules, or state, or something, that fixed whatever problem.

Change #1124756 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: haproxy: check ingress workers with the /healthz endpoint

https://gerrit.wikimedia.org/r/1124756

aborrero changed the task status from Open to In Progress.Mar 5 2025, 3:03 PM
aborrero triaged this task as Medium priority.

Change #1124756 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: haproxy: check ingress workers with the /healthz endpoint

https://gerrit.wikimedia.org/r/1124756

Change #1124829 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress

https://gerrit.wikimedia.org/r/1124829

Change #1124829 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress

https://gerrit.wikimedia.org/r/1124829

aborrero claimed this task.