On Datacenter switchover, existing latent problems surfaced causing an overload (T163351) of enwiki (and to less extent, other wikis) databases. This is the nth case on an outage related to bad performace queries part of mediawiki core- the outages are so frequent, that we no longer report each individual incident: ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170320-Special_AllPages ) The reasons for these were several:
- codfw, being depooled, is easier to keep up to date with the latest mediwiki structure, so some schema changes on HEAD were already merged on codfw and not yet on eqiad (but they are being merged now there). The current mediawiki schema is failing *a lot* for enwiki and other large wikis. Yes, newer versions of HEAD are less stable than older ones, database-wise.
- eqiad had some servers with SSDs and very large memory sizes, which mitigate, but do not solve completely the issue
- codfw is cold (buffers were mostly empty), something that cannot be guarantee at any time on eqiad- because servers crash, there are emergencies, hardware changes, increases in requests, etc. It is ok for queries to be slower than usual in this situation, but not so slow that they cannot run and overload all available hardware.
- After lots of problems showing on eqiad, there are undocuemnted hacks and workarounds like query rewrites and special analytics in place- those should not be needed, and are abset on codfw (and now, on eqiad, too)
That makes currently codfw performance incredibly unbearable -performance said a 50% increase in average latency, and it has caused complains mostly from API users that fail to execute (but many different queries are affected, it just happens that API users are most vocal).
Link: https://logstash.wikimedia.org/goto/7f4de766d19711bea1afc6eec6e0cf7c (the translate-related queries spike can be ignored, it is not part of the scope)
Proposals:
- Limit execution time for webrequests in an effective way, so that the application is aware of queries that will never finish and kills them and reacts appopiately on DOS of extreme load. For starters, I am going to work on T160984 for a more strict db-side query killer, but that is a (bad) hack, and something should be considered on HHVM or mediwiki layer for longer term
- Go over the list of top list of slow queries on codfw P5295 (again, ignore the translation and wikidata ones) and workaround the ones on top now, so they can be executed- even if later there is a more long term plan. Assign them to the respective teams and work on them so the same issue doesn't happen again when things fail back to eqiad (where the latest revision structure changes have already almost finished applying)
- Review the revision structure and indexes, and try to reduce those or anything necessary- as having 6 indexes is creating huge trouble to mysql query planner to chose the right index. This was not properly tested on an enwiki-sized environment (400GB revision table), and clearly doesn't scale for it.
This should be done before the failback (not the long term stuff, but quick measures to solve the ongoing issues), or another outage will most likely happen again.