During T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures many Toolforge tools were unavailable (tools-proxy-9 was returning an error page).
Yet no paging alert was sent. Also no toolforge related alert.
You can check the full list of alerts that fired at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-feed/20250711.txt
Some useful ones that could be configured to send a page:
- FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin
- FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
- FIRING: [3x] InstanceDown: Project tools instance tools-elastic-6 is down
- FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
- FIRING: ToolsNFSDown: No tools nfs services running found
- FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0
- FIRING: WidespreadInstanceDown: Widespread instances down in project tools
The last one (Widespread instances down in project tools) would be my favourite, but it only started firing a few hours into the outage, after we had already got multiple reports from users.
Maybe paging on both Widespread instances down in project tools and Widespread instances down in project cloudinfra could be a decent solution.
Ideally, we would have probes tracking a number of tools, and we could page when the percentage of unresponsive tools is higher than a threshold (50% or something like that).