T119736 showed that we sometime fail to recognize the severity of bugs that have a substantial impact on users. To make sure nothing slips through the cracks, we should have monitoring and alerting of small set of key "business" metrics. Namely:
- Logins, Sign-ups, account creation - T146090
- Edits
- Exceptions / fatals
- MediaWiki load time - T146125
See also:
Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki