Page MenuHomePhabricator

HAProxy frontend session limit hit (repeat outage)
Closed, ResolvedPublicBUG REPORT

Description

It looks like the frontend HAProxy session limit was hit again today

Screenshot 2025-11-18 at 22.36.19.png (612×1 px, 135 KB)

This caused alerts to maintainers for multiple tools

Screenshot 2025-11-18 at 22.36.54.png (822×1 px, 158 KB)

time=2025-11-18T12:42:16.241Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=432.39702ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T12:42:16.362Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=553.116317ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T12:42:25.984Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=235.314882ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T12:42:26.100Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=351.014641ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T12:42:57.761Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=480.65231ms alerts=[ReviewInterfaceDown[53cd55d][active]]
time=2025-11-18T12:42:57.763Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=482.581124ms alerts=[ReviewInterfaceDown[53cd55d][active]]
time=2025-11-18T15:34:16.228Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=420.785413ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T15:34:17.559Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=1.751825757s alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T15:34:26.151Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=402.61168ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T15:34:26.163Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=414.449609ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T15:34:57.549Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=265.518012ms alerts=[ReviewInterfaceDown[53cd55d][active]]
time=2025-11-18T15:34:57.555Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=271.438908ms alerts=[ReviewInterfaceDown[53cd55d][active]]

As this infra is shared by all of tools further hardening should be applied to avoid 1 tool impacting other tools.

I have also observed intermittent 500 (branded) error pages being returned for ingress services that are healthy - most tools have been changed to use internal service names to work around the unreliability of the ingress infrastructure, however this sets them up to break down the line if the cluster is split/expanded.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi subscribed.

This outage is over and the particular issue was fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1206902. Follow-ups are tracked in non-public T410433.

I got another alert around 00:28 CET, will investigate when I have a moment, but had the same symptoms as this.

I can't subscribe to T410433, so reporting as observed until there is a long term resolution