Page MenuHomePhabricator

Deploy process should be updated so JavaScript errors block the train
Closed, ResolvedPublic

Description

We now log client side errors on Wikimedia wikis averaging approx 6,000 errors every 24hr period once various errors originating from code we do not maintain, older browsers and browser extensions is filtered out:
https://logstash.wikimedia.org/app/dashboards#/view/AXDBY8Qhh3Uj6x1zCF56?_g=h@2b7c814&_a=h@a9bf6d5
At time of writing all client-side errors occur with a frequency of less than 500.

On a few occasions I've manually checked this board after a group 1 deploy, and created UBN tasks for any new error where a new error has emerged with a frequency greater than 1000 in a 12 hr period or anything greater than 100 in the last 1hr that impacts more than 1 IP address (or anonymized session identifier).

We're currently in a good state of code health so going forward, we should continue to maintain this bar of health, by blocking or rolling back any newly introduced code that creates problems.

Previously I've discussed this with @thcipriani @brennen and @Jrbranaa and I'm keen to "codify" this into our deploy process. Practically speaking this would mean halting the train and creating a new UBN task blocking the train when such a bug with frequency > 100 per hour is encountered. This could be done by manually glancing at the dashboard or via some kind of alerting system (TBD).

In terms of when to roll back: this would depend on the response to an UBN as sometimes rolling back changes can make things worse given our approach to caching. I'm looking into generating some kind of alerts to indicate when an error is high volume enough that we might want to roll back the train, however that threshold will be considerably higher and based on total errors.

Let me know your thoughts!

Event Timeline

After discussion last week (or thereabouts), we updated the "Holding the train" docs that constitute official policy to reflect the "> 1000 in 12 hours" requirement, but deployment docs in general are kind of scattered, redundant, and in need of work (see T273802 for a placeholder on that one).

In general, I think this just requires:

  • More clarification in docs.
  • Making sure deployers are aware of this new practice and integrating with the log triage process.
  • Publicizing the policy change to the technical community as a whole.

Also, we'd just been going off the "1k in 12 hours" part of the rule, so I guess things should be amended to include "> 100 in 1 hour".

This could be done by manually glancing at the dashboard or via some kind of alerting system (TBD)

Also, yeah, I the the clearer a guideline we can get here, the more easily we can automate it. There are probably some conversations with folks in Observability to be had about this. Defining the "new error" part of this in such a way that things can be automated is probably the tricky bit.

I think codifying things like this makes sense. I'd like to also explore other potential attributes that could help establish severity. I've given frequency attribute a name - Error Arrival Rate (stolen from Defect Arrival Rate) :-) These may prove to be a bit more difficult to identify and analyze, so we'll need to take that into consideration. In any case, we'll need to figure out how to perform said analysis with minimal impact to the train conductor.

After the train rolled out today, I got a notification of an error spike (https://grafana-rw.wikimedia.org/d/000000566/overview?tab=alert&editPanel=16&orgId=1) however when I checked with @brennen he hadn't seen it. Should this notification also go to releng@lists.wikimedia.org ? (@colewhite can likely help if so)

Are Release Engineering getting these alerts or using them for the 1hr following a train rollout. If so, can we resolve this? If not, what do we need to do and is there anything I can do to help?

thcipriani claimed this task.

@thcipriani I am going to assume resolving this means that release engineering are using the alerts in https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1.

Please, let me know if that's not the case and you need help getting those into your team's workflow. Would be happy to help!

@thcipriani I am going to assume resolving this means that release engineering are using the alerts in https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1.

Please, let me know if that's not the case and you need help getting those into your team's workflow. Would be happy to help!

Noting here for completeness that @Jdlrobson and I set this up this week.