It looks like the frontend HAProxy session limit was hit again today
This caused alerts to maintainers for multiple tools
time=2025-11-18T12:42:16.241Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=432.39702ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T12:42:16.362Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=553.116317ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T12:42:25.984Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=235.314882ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T12:42:26.100Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=351.014641ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T12:42:57.761Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=480.65231ms alerts=[ReviewInterfaceDown[53cd55d][active]]
time=2025-11-18T12:42:57.763Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=482.581124ms alerts=[ReviewInterfaceDown[53cd55d][active]]time=2025-11-18T15:34:16.228Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=420.785413ms alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T15:34:17.559Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReportInterfaceDown\"}" attempts=1 duration=1.751825757s alerts=[ReportInterfaceDown[c39c10a][active]]
time=2025-11-18T15:34:26.151Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=402.61168ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T15:34:26.163Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"TrainingDataDown\"}" attempts=1 duration=414.449609ms alerts=[TrainingDataDown[4d38484][active]]
time=2025-11-18T15:34:57.549Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[0] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=265.518012ms alerts=[ReviewInterfaceDown[53cd55d][active]]
time=2025-11-18T15:34:57.555Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=email_contacts integration=email[1] aggrGroup="{}:{alertname=\"ReviewInterfaceDown\"}" attempts=1 duration=271.438908ms alerts=[ReviewInterfaceDown[53cd55d][active]]As this infra is shared by all of tools further hardening should be applied to avoid 1 tool impacting other tools.
I have also observed intermittent 500 (branded) error pages being returned for ingress services that are healthy - most tools have been changed to use internal service names to work around the unreliability of the ingress infrastructure, however this sets them up to break down the line if the cluster is split/expanded.

