CX **was [[ https://gerrit.wikimedia.org/r/#/c/348951/1 | disabled ]], but is now back online** , because it caused an outage on one of the database servers affecting CX and other products. It was triggered by the datacenter switch, but root cause is not yet known.
Summary so far:
* This issue looks similar to the previous incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation (after which we fixed bugs in the auto-save, added a ping-limiter, and Aaron improved the queries and locking)
* There were hundreds of blocked queries, that eventually brought the database down by exceeding the connection limit
* There wasn't very high load on the database, most of the queries were in the wait state
* Language team changed the front-end to be much more conservative in the amount and delay between retries and saving in general to mitigate the symptoms in the future: https://gerrit.wikimedia.org/r/349214
* Issue has been narrowed to the blocked queries, but we can't yet fully understand the root cause or reproduce the issue. Language team is not very familiar on the nuances of database locks, so asking for help.
* CX has been re-enabled, but we are monitoring closely and ready to revert.
* **Incident report** (In progress): https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-ContentTranslation