Page MenuHomePhabricator

toolforge-legacy-redirector: constant failed probes by prometheus
Closed, ResolvedPublic

Description

I suspect the toolforge-legacy-redirector apache server is struggling. The server gets generous req/s, only has 1 CPU, so we may need to scale and tune the setup a bit.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2025-02-07T20:51:07Z] <arturo> resize tools-legacy-redirector to have 2 vCPU T385908

One option to reduce load on this box would be to add the hostnames it serves to HSTS preload lists and then stop responding anything on port 80. That should cut down a sizeable number of old abandoned clients requesting map things while still ensuring that we respond to browsers when someone is using some old link from somewhere.

Change #1123797 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] toolforge_redirector: increase monitoring timeout

https://gerrit.wikimedia.org/r/1123797

It seems to be mostly https traffic that get delayed (not sure yet why), @aborrero suggested skipping the http->https local redirect for directly https://*.toolforge.org, that would help a lot of the traffic and relieve some of the load (though CPU wise it seems ok right now, so not sure what resource is the bottleneck).

It seems to be mostly https traffic that get delayed (not sure yet why),

Some of the tests:

# Quick'n'dirty stats on how many http requests arrive vs total traffic
root@tools-legacy-redirector-2:~# wc /var/log/apache2/tools.wmflabs.org-access.log
   3421818  101196701 1119031168 /var/log/apache2/tools.wmflabs.org-access.log
root@tools-legacy-redirector-2:~# grep -e '302\s.*GET\s*http:' /var/log/apache2/tools.wmflabs.org-access.log | wc
1423299 40169641 404624627

Time of curl:

# same going through the external ip
root@tools-legacy-redirector-2:~# time curl --silent  https://127.0.0.1/ -H "Host: tools.wmflabs.org" > /dev/null

real    0m7.569s
user    0m0.069s
sys     0m0.021s


root@tools-legacy-redirector-2:~# time curl --silent  http://127.0.0.1/ -H "Host: tools.wmflabs.org" > /dev/null

real    0m0.058s
user    0m0.008s
sys     0m0.014s
dcaro@tools-proxy-8:~$ time curl --silent  https://tools.wmflabs.org/ > /dev/null

real	0m7.368s
user	0m0.071s
sys	0m0.022s
dcaro@tools-proxy-8:~$ time curl --silent  http://tools.wmflabs.org/ > /dev/null

real	0m0.028s
user	0m0.016s
sys	0m0.009s

The probe has been failing more frequently in the past week:

Screenshot 2025-03-04 at 14.58.45.png (1×2 px, 468 KB)

I rebooted the VM reboot tools-legacy-redirector-2, let's see if it helps.

After the reboot, it started flapping at an almost regular interval:

Screenshot 2025-03-04 at 18.20.04.png (1×2 px, 726 KB)

Grafana link (requires login): https://grafana.wmcloud.org/goto/-OuSu-tHz?orgId=1

I added a silence in Alertmanager until we figure out how to improve the situation.

Manually did some changes:

And it has not triggered yet since then, will leave it for a bit and persist the changes in puppet if they are consistent tomorrow.

Thanks @dcaro! I deleted my 7-day silence in alertmanager, so we'll get an email if the alert triggers again.

It failed overnight :/

Just re-tweaked the mpm_event config to start a lot more threads (that are memory cheap) and keep them spare:

StartServers            4
MinSpareThreads         250
MaxSpareThreads         250
ThreadLimit             250
ThreadsPerChild         250
MaxRequestWorkers       1000
MaxConnectionsPerChild  0
AsyncRequestWorkerFactor 8

It started ~500 threads on restart (from the ~100 it was starting before).

Will keep an eye

Change #1126511 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] tools-legacy-redirector: use a custom event_mpm config

https://gerrit.wikimedia.org/r/1126511

Change #1126511 merged by David Caro:

[operations/puppet@production] tools-legacy-redirector: use a custom mpm_event config

https://gerrit.wikimedia.org/r/1126511

Change #1123797 merged by Andrew Bogott:

[operations/puppet@production] toolforge_redirector: increase monitoring timeout

https://gerrit.wikimedia.org/r/1123797

it's been a while since I've seen any of these, closing