CX is [[ https://gerrit.wikimedia.org/r/#/c/348951/1 | currently disabled ]], because it caused an outage on one of the database servers affecting CX and other products. It was triggered by the datacenter switch, but root cause is not yet known.
Summary so far:
* This issue looks similar to the previous incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation (after which we fixed bugs in the auto-save, added a ping-limiter, and Aaron improved the queries and locking)
* There were hundreds of blocked queries, that eventually brought the database down by exceeding the connection limit
* There wasn't very high load on the database, most of the queries were in the wait state
* Language team changed the front-end to be much more conservative in the amount and delay between retries and saving in general to mitigate the symptoms in the future
* Issue has been narrowed to the blocked queries, but we can't yet fully understand the root cause or reproduce the issue. Language team is not very familiar on the nuances of database locks
* Currently trying to determine the root cause and negotiating whether CX can be enabled before it is fully understood (i.e. is the risk of this happening again low enough together with the front-end changes)