We now log client side errors on Wikimedia wikis averaging approx 6,000 errors every 24hr period once various errors originating from code we do not maintain, older browsers and browser extensions is filtered out:
https://logstash.wikimedia.org/app/dashboards#/view/AXDBY8Qhh3Uj6x1zCF56?_g=h@2b7c814&_a=h@a9bf6d5
At time of writing all client-side errors occur with a frequency of less than 500.
On a few occasions I've manually checked this board after a group 1 deploy, and created UBN tasks for any new error where a new error has emerged with a frequency greater than 1000 in a 12 hr period or anything greater than 100 in the last 1hr that impacts more than 1 IP address (or anonymized session identifier).
We're currently in a good state of code health so going forward, we should continue to maintain this bar of health, by blocking or rolling back any newly introduced code that creates problems.
Previously I've discussed this with @thcipriani @brennen and @Jrbranaa and I'm keen to "codify" this into our deploy process. Practically speaking this would mean halting the train and creating a new UBN task blocking the train when such a bug with frequency > 100 per hour is encountered. This could be done by manually glancing at the dashboard or via some kind of alerting system (TBD).
In terms of when to roll back: this would depend on the response to an UBN as sometimes rolling back changes can make things worse given our approach to caching. I'm looking into generating some kind of alerts to indicate when an error is high volume enough that we might want to roll back the train, however that threshold will be considerably higher and based on total errors.
Let me know your thoughts!