Page MenuHomePhabricator

[k8s,infra,o11y] Add paging alert when many tools are unreachable
Closed, ResolvedPublic

Description

During T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures many Toolforge tools were unavailable (tools-proxy-9 was returning an error page).

Yet no paging alert was sent. Also no toolforge related alert.

You can check the full list of alerts that fired at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-feed/20250711.txt

Some useful ones that could be configured to send a page:

  • FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin
  • FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
  • FIRING: [3x] InstanceDown: Project tools instance tools-elastic-6 is down
  • FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
  • FIRING: ToolsNFSDown: No tools nfs services running found
  • FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0
  • FIRING: WidespreadInstanceDown: Widespread instances down in project tools

The last one (Widespread instances down in project tools) would be my favourite, but it only started firing a few hours into the outage, after we had already got multiple reports from users.

Maybe paging on both Widespread instances down in project tools and Widespread instances down in project cloudinfra could be a decent solution.

Ideally, we would have probes tracking a number of tools, and we could page when the percentage of unresponsive tools is higher than a threshold (50% or something like that).

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-22T07:44:28Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399870)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-22T07:45:19Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399870)

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudcephosd1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
dcaro renamed this task from Add paging alert when many tools are unreachable to [k8s,infra,o11y] Add paging alert when many tools are unreachable.Jul 22 2025, 12:51 PM
dcaro triaged this task as High priority.
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.

Ideally, we would have probes tracking a number of tools, and we could page when the percentage of unresponsive tools is higher than a threshold (50% or something like that).

Instead of probes, what about measuring the percentage or rate of 5xx errors returned for real user traffic?

Instead of probes, what about measuring the percentage or rate of 5xx errors returned for real user traffic?

+1 this seems a viable approach. Do we already have metrics tracking this?

Not at the moment, I think. I see two options for collecting that: using mtail to collect it at the front proxy nginx level[0], or running HAProxy in HTTP mode (and not in TCP mode we currently do) and collecting metrics on that level. I prefer the latter option.

[0]: nginx-exporter doesn't have metrics for this in the free software version of nginx