I suspect the toolforge-legacy-redirector apache server is struggling. The server gets generous req/s, only has 1 CPU, so we may need to scale and tune the setup a bit.
Description
Details
Related Objects
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2025-02-07T20:51:07Z] <arturo> resize tools-legacy-redirector to have 2 vCPU T385908
One option to reduce load on this box would be to add the hostnames it serves to HSTS preload lists and then stop responding anything on port 80. That should cut down a sizeable number of old abandoned clients requesting map things while still ensuring that we respond to browsers when someone is using some old link from somewhere.
Change #1123797 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] toolforge_redirector: increase monitoring timeout
It seems to be mostly https traffic that get delayed (not sure yet why), @aborrero suggested skipping the http->https local redirect for directly https://*.toolforge.org, that would help a lot of the traffic and relieve some of the load (though CPU wise it seems ok right now, so not sure what resource is the bottleneck).
It seems to be mostly https traffic that get delayed (not sure yet why),
Some of the tests:
# Quick'n'dirty stats on how many http requests arrive vs total traffic root@tools-legacy-redirector-2:~# wc /var/log/apache2/tools.wmflabs.org-access.log 3421818 101196701 1119031168 /var/log/apache2/tools.wmflabs.org-access.log root@tools-legacy-redirector-2:~# grep -e '302\s.*GET\s*http:' /var/log/apache2/tools.wmflabs.org-access.log | wc 1423299 40169641 404624627
Time of curl:
# same going through the external ip root@tools-legacy-redirector-2:~# time curl --silent https://127.0.0.1/ -H "Host: tools.wmflabs.org" > /dev/null real 0m7.569s user 0m0.069s sys 0m0.021s root@tools-legacy-redirector-2:~# time curl --silent http://127.0.0.1/ -H "Host: tools.wmflabs.org" > /dev/null real 0m0.058s user 0m0.008s sys 0m0.014s
dcaro@tools-proxy-8:~$ time curl --silent https://tools.wmflabs.org/ > /dev/null real 0m7.368s user 0m0.071s sys 0m0.022s dcaro@tools-proxy-8:~$ time curl --silent http://tools.wmflabs.org/ > /dev/null real 0m0.028s user 0m0.016s sys 0m0.009s
The probe has been failing more frequently in the past week:
I rebooted the VM reboot tools-legacy-redirector-2, let's see if it helps.
After the reboot, it started flapping at an almost regular interval:
Grafana link (requires login): https://grafana.wmcloud.org/goto/-OuSu-tHz?orgId=1
I added a silence in Alertmanager until we figure out how to improve the situation.
Manually did some changes:
- Copied over the list of redirects for tools from the https section to the http section to avoid the extra redirect from http -> ourselves:https -> toolforge.org
- Set the AsyncRequestWorkerFactor to 4 (https://httpd.apache.org/docs/current/mod/event.html#asyncrequestworkerfactor) on /etc/apache2/mods-enabled/mpm_event.conf
And it has not triggered yet since then, will leave it for a bit and persist the changes in puppet if they are consistent tomorrow.
Thanks @dcaro! I deleted my 7-day silence in alertmanager, so we'll get an email if the alert triggers again.
It failed overnight :/
Just re-tweaked the mpm_event config to start a lot more threads (that are memory cheap) and keep them spare:
StartServers 4 MinSpareThreads 250 MaxSpareThreads 250 ThreadLimit 250 ThreadsPerChild 250 MaxRequestWorkers 1000 MaxConnectionsPerChild 0 AsyncRequestWorkerFactor 8
It started ~500 threads on restart (from the ~100 it was starting before).
Will keep an eye
Change #1126511 had a related patch set uploaded (by David Caro; author: David Caro):
[operations/puppet@production] tools-legacy-redirector: use a custom event_mpm config
Change #1126511 merged by David Caro:
[operations/puppet@production] tools-legacy-redirector: use a custom mpm_event config
Change #1123797 merged by Andrew Bogott:
[operations/puppet@production] toolforge_redirector: increase monitoring timeout


