When pushing wmf.19 last week, that broke creation of any account for almost 18 hours. The error rates are definitely collected as shown on T145839#2643508:
https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen
A single monitoring probe against that metric would definitely have prevented the fiasco. We only found it the day after via a daily browser test that failed.
The easy first task is to add an Icinga probe that check the above metric and alarms out. I have no idea though where the notification should be sent and to whom.
Related incidents: