During {T399281} many Toolforge tools were unavailable (tools-proxy-9 was returning an error page).
Yet no paging alert was sent. Also no toolforge related alert.
You can check the full list of alerts that fired at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-feed/20250711.txt
Some useful ones that could be configured to send a page:
* FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin
* FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
* FIRING: [3x] InstanceDown: Project tools instance tools-elastic-6 is down
* FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
* FIRING: ToolsNFSDown: No tools nfs services running found
* FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0
* FIRING: WidespreadInstanceDown: Widespread instances down in project tools
The last one (//Widespread instances down in project tools//) would be my favourite, but it only started firing a few hours into the outage, after we had already got multiple reports from users.
Maybe paging on both //Widespread instances down in project tools// and //WidespreadInstanceDown: Widespread instances down in project cloudinfra// could be a decent solution.
Ideally, we would have probes tracking a number of tools, and we could page when the percentage of unresponsive tools is higher than a threshold (50% or something like that).