Page MenuHomePhabricator

[Spike] Write reports about why Ext:ORES is helping cause server 500s and write tasks to fix
Closed, ResolvedPublic

Description

Points to cover:

  • What happened two weeks ago, that Ext:ORES exacerbated T179156.
  • Incident report for T181006: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151216-ores
  • Emergency protocols for keeping critical pages such as Special:RecentChanges up even when Ext:ORES fails.
  • Ext:ORES and the data flow that makes it fragile: T181831
  • Ext:ORES and how to fail gracefully to the user while still blowing up the logs to get attention when *appropriate*: T181191
  • How to maintain latest production rollback SHA-1 even when using tin to deploy to multiple clusters.
  • Document protocol for watching both "client" (MW/Ext:ORES) and server-side errors during deployment.
  • Improvements to ORES beta testing, so we could have reproduced or forseen this bug. T181187, T181168
  • Thoughts about how we might be able to canary all the Special pages on each language when making ORES changes that might affect all wikis: T181830
  • How to speed up deployment and rollback--currently takes 43 min to push a new version, and NN min to rollback: T181067, T181071

Deployment documentation can be found in https://wikitech.wikimedia.org/wiki/ORES/Deployment

Related Objects

Event Timeline

awight updated the task description. (Show Details)

RecentChanges is one of the core features that basically can't be unavailable. For features like this we need to have a revert mentality first (I fully admin to struggling with this too when its one of my features), and then try and debug on one of the mwdebug servers, or figure out something else. The code seems to have changed a lot since my initial version :) I'll review it a bit too.

In T181168, I capture the error message we wanted to see in Beta. In order to get to there, we needed to fix the configuration for ruwiki's beta install so that it would use the "goodfaith" model at all.

@Halfak: I might be wrong, but it looks like not all of the bullet points in the description here are addressed in the incident report. This task isn't for the incident report, it's a bit larger view.

@greg Good point—I'd like to keep this parent task active at least until the bullet points are represented by their own tasks.

awight renamed this task from Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix to [Spike] Write reports about why Ext:ORES is helping cause server 500s and write tasks to fix.Dec 1 2017, 3:53 PM
awight updated the task description. (Show Details)

Deployment documentation is updated, struck-through task points to reflect this.

awight triaged this task as High priority.
awight updated the task description. (Show Details)

Closing this task as all open work exists as subtasks.