CX **was [[ https://gerrit.wikimedia.org/r/#/c/348951/1 | disabled ]], but is now back online** , because it caused an outage during the datacenter switch on one of the database servers affecting CX and other products.
Summary so far:
* This issue looks similar to the previous incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation (after which we fixed bugs in the auto-save, added a ping-limiter, and Aaron improved the queries and locking)
* There were hundreds of blocked queries, that eventually brought the database down by exceeding the connection limit
* There wasn't very high load on the database, most of the queries were in the wait state
* Language team changed the front-end to be much more conservative in the amount and delay between retries and saving in general to mitigate the symptoms in the future: https://gerrit.wikimedia.org/r/349214
* CX has been re-enabled.
* Likely root cause has been found: a bug in the frontend code that in certain articles caused the save draft request size to be extra large due to inclusion of unrelated content combined with unoptimal autosave-retry-logic. Both have been fixed.
* **Incident report** (In progress): https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-ContentTranslation