When pushing wmf.19 last week, that broke creation of any account for almost 18 hours. The error rates are definitely collected as shown on T145839#2643508:
A single monitoring probe against that metric would definitely have prevented the fiasco. We only found it the day after via a daily browser test that failed.
The easy first task is to add an Icinga probe that check the above metric and alarms out. I have no idea though where the notification should be sent and to whom.