Mediawiki revision-related queries are failing with high rate for enwiki on codfw
Closed, ResolvedPublic
Actions

Description

On Datacenter switchover, existing latent problems surfaced causing an overload (T163351) of enwiki (and to less extent, other wikis) databases. This is the nth case on an outage related to bad performace queries part of mediawiki core- the outages are so frequent, that we no longer report each individual incident: ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170320-Special_AllPages ) The reasons for these were several:

codfw, being depooled, is easier to keep up to date with the latest mediwiki structure, so some schema changes on HEAD were already merged on codfw and not yet on eqiad (but they are being merged now there). The current mediawiki schema is failing *a lot* for enwiki and other large wikis. Yes, newer versions of HEAD are less stable than older ones, database-wise.
eqiad had some servers with SSDs and very large memory sizes, which mitigate, but do not solve completely the issue
codfw is cold (buffers were mostly empty), something that cannot be guarantee at any time on eqiad- because servers crash, there are emergencies, hardware changes, increases in requests, etc. It is ok for queries to be slower than usual in this situation, but not so slow that they cannot run and overload all available hardware.
After lots of problems showing on eqiad, there are undocuemnted hacks and workarounds like query rewrites and special analytics in place- those should not be needed, and are abset on codfw (and now, on eqiad, too)

That makes currently codfw performance incredibly unbearable -performance said a 50% increase in average latency, and it has caused complains mostly from API users that fail to execute (but many different queries are affected, it just happens that API users are most vocal).

Link: https://logstash.wikimedia.org/goto/7f4de766d19711bea1afc6eec6e0cf7c (the translate-related queries spike can be ignored, it is not part of the scope)

Proposals:

Limit execution time for webrequests in an effective way, so that the application is aware of queries that will never finish and kills them and reacts appopiately on DOS of extreme load. For starters, I am going to work on T160984 for a more strict db-side query killer, but that is a (bad) hack, and something should be considered on HHVM or mediwiki layer for longer term
Go over the list of top list of slow queries on codfw P5295 (again, ignore the translation and wikidata ones) and workaround the ones on top now, so they can be executed- even if later there is a more long term plan. Assign them to the respective teams and work on them so the same issue doesn't happen again when things fail back to eqiad (where the latest revision structure changes have already almost finished applying)
Review the revision structure and indexes, and try to reduce those or anything necessary- as having 6 indexes is creating huge trouble to mysql query planner to chose the right index. This was not properly tested on an enwiki-sized environment (400GB revision table), and clearly doesn't scale for it.

This should be done before the failback (not the long term stuff, but quick measures to solve the ongoing issues), or another outage will most likely happen again.

Details

	Subject	Repo	Branch	Lines +/-
	ApiQueryContributors: Improve query behavior	mediawiki/core	master	+60 -45

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T199073 Perform a datacenter switchover (2018-19 Q1)
Resolved	None	T163495 Mediawiki revision-related queries are failing with high rate for enwiki on codfw
Resolved	• Marostegui	T163351 codfw API slaves overloaded during the 2017-04-19 codfw switch
Resolved	None	T163413 Pool new server db2071
Resolved	Krinkle	T116557 AFComputedVariable::compute query timeouts
Resolved	• jcrespo	T160984 Reduce max execution time of interactive queries or a better detection and killing of bad query patterns
Resolved	hoo	T163544 Wikibase\TermSqlIndex::getMatchingTerms , Wikibase\TermSqlIndex::fetchTerms have bad performance after codfw failover

Event Timeline

• jcrespo created this task.Apr 20 2017, 7:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 20 2017, 7:56 PM

• jcrespo updated the task description. (Show Details)Apr 20 2017, 8:17 PM

Aklapper added a project: codfw-rollout.Apr 20 2017, 8:38 PM

faidon moved this task from Backlog to Fallout from Apr/May 2017 switch on the codfw-rollout board.Apr 21 2017, 12:53 AM

SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_text_id, rev_timestamp, rev_comment, rev_user_text, rev_user, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, rev_content_format, rev_content_model, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, old_id, old_text, old_flags FROM `revision` INNER JOIN `page` ON ((page_id = rev_page)) INNER JOIN `text` ON ((rev_text_id=old_id)) WHERE (rev_id<='35367939') AND rev_page = '138161' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 26 /* 699b8c7a81e8de411afbc1ddf6042bba db2059 dewiki 18s */

This is a bug, it pages on rev_id but sorts by rev_timestamp. It uses the rev_page_id index, which I think shouldn't even exist.

SELECT /* ApiQueryContributors::execute */ rev_page AS `page`, COUNT(DISTINCT rev_user_text) AS `anons` FROM `revision` WHERE rev_page = '100796' AND (rev_user = 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page /* 0d63de509f37f53cc2d9a07e800a21cb db2062 enwiki 13s */

This was slow when I tested it the first time (6s) and fast after that (0.02s), suggesting the necessary data was not in memory. It scans 3384 rows, and this is done unconditionally for every request, while logged-in user names are queried with limit/continue. So using the continue parameter to page through all results in batches of 50 will scan some 230k rows.

Options for rectification:

Maintain per-page revision statistics, update anon count on save
Remove the separate anon count query, remove rev_user != 0 condition in the main query, have PHP count anons in the result set. Require clients to add together paged anon counts.
Scale back the feature. This module is presumably for OCG, which presents text like "Contributors: The Rambling Man, Yoninah, [...] and Anonymous: 21". Instead it could say "and anonymous contributors", then we could limit the query to a single row. Redefine anoncontributors to return 0 for no anons or 1 if there were anons, instead of the actual count.

The code notes:

/** We don't want to process too many pages at once (it hits cold
 * database pages too heavily), so only do the first MAX_PAGES input pages
 * in each API call (leaving the rest for continuation).
 */
const MAX_PAGES = 100;

MAX_PAGES is probably too high, if a single page take 13s.

Hey, thanks, can you and @Anomie help me do a sprint and commit as many fixes as possible before the fail back? For example, for

SELECT rev_user_text FROM revision WHERE rev_page = '482256' ORDER BY rev_timestamp DESC LIMIT 100

it gets confused between (rev_page,rev_user, rev_timestamp) and (rev_page, rev_timestamp) because how many rows are there and how similar the index are. I can try running ANALYZE (but it iakes 3 days to run per server on that gigantic table), so FORCE INDEX may be needed.

tstarling mentioned this in T163532: Drop index rev_page_id (rev_page, rev_id).Apr 21 2017, 6:24 AM

Moving this comment to the subticket.

• jcrespo added a subtask: T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch.Apr 21 2017, 7:39 AM

• jcrespo added a subtask: T116557: AFComputedVariable::compute query timeouts.Apr 21 2017, 7:48 AM

• jcrespo mentioned this in T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch.Apr 21 2017, 7:54 AM

• jcrespo closed subtask T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch as Resolved.

• jcrespo added a subtask: T160984: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns.Apr 21 2017, 10:29 AM

• jcrespo created subtask T163544: Wikibase\TermSqlIndex::getMatchingTerms , Wikibase\TermSqlIndex::fetchTerms have bad performance after codfw failover.Apr 21 2017, 11:12 AM

greg moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.Apr 21 2017, 4:54 PM

In T163495#3200550, @tstarling wrote:

and this is done unconditionally for every request

That's easy enough to fix.

Maintain per-page revision statistics, update anon count on save

That could be useful. Look at stuff in the web UI action=info and API prop=info for more things that could be stored in there.

Remove the separate anon count query, remove rev_user != 0 condition in the main query, have PHP count anons in the result set. Require clients to add together paged anon counts.

Lists of every IP address to ever edit the page would probably not be hugely useful. It'd also take some rethinking of the query used, since currently it tries to use a prefix on the page_user_timestamp index.

Scale back the feature. This module is presumably for OCG, which presents text like "Contributors: The Rambling Man, Yoninah, [...] and Anonymous: 21". Instead it could say "and anonymous contributors", then we could limit the query to a single row. Redefine anoncontributors to return 0 for no anons or 1 if there were anons, instead of the actual count.

Yes, it was created to support OCG. A boolean would be better in that case. It would need someone to change OCG, I have no idea if it rides a train or what.

MAX_PAGES is probably too high, if a single page take 13s.

Suggestions?

Change 349479 had a related patch set uploaded (by Anomie):
[mediawiki/core@master] ApiQueryContributors: Improve query behavior

https://gerrit.wikimedia.org/r/349479

gerritbot added a project: Patch-For-Review.Apr 21 2017, 5:33 PM

• Marostegui mentioned this in T115825: action=query&list=pagepropnames really slow on a big wiki, got error with ppnlimit=500 function: /* ApiQueryPagePropNames::execute */.Apr 24 2017, 6:47 AM

• Marostegui mentioned this in T123867: Repeated reports of wikidatawiki (s5) API going read only.Apr 24 2017, 6:59 AM

• jcrespo mentioned this in T163658: ApiQueryPagePropNames is extremely slow on Wikidata.Apr 24 2017, 8:15 AM

hoo closed subtask T163544: Wikibase\TermSqlIndex::getMatchingTerms , Wikibase\TermSqlIndex::fetchTerms have bad performance after codfw failover as Resolved.Apr 24 2017, 12:00 PM

Krinkle closed subtask T116557: AFComputedVariable::compute query timeouts as Resolved.Apr 28 2017, 3:08 AM

• jcrespo closed subtask T160984: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns as Resolved.Apr 28 2017, 6:11 PM

@jcrespo Any next steps we can help with? https://logstash.wikimedia.org/goto/f06abf4c683a9ebe13ca347bfed2c432 still shows a lot, but not sure if there are any left we suspect are related to codfw in particular (and/or the schema differences from eqiad).

ApiQueryRevisions::run should be the next target- It fails at a higher rate than on eqiad, but I am not sure that is in scope for this particular ticket, because it already failed a lot on eqiad (just more spiky and with ~double the rate on codfw).

I would ask you or Anomie to give a quick look at https://logstash.wikimedia.org/goto/c90cfb80e16b8ce55e678d4a1ab368d8 ( Api::QueryRevisions::run errors doubled after the failover) to try to see a pattern (is it just less resources?). Identifying the root issue is the priority, not fixing it- as we will have less information after Wednesday (back on eqiad).

In summary:

Identify why Api::QueryRevisions::run is failing more frequently on codfw (if possible)
Then close this ticket as with this and the subtickets the initial goal has been fulfilled
Continue fixing API queries on separate tickets, with less priority

For the record LocalRepo::FindFiles failed heavily around the failover times, but not anymore: https://logstash.wikimedia.org/goto/51e78fc3a2cff810e92f5edaf0f48a26 however, based on the timestamps, it was unrelated to the switchover and probably release-related.

In T163495#3223540, @jcrespo wrote:

I would ask you or Anomie to give a quick look at https://logstash.wikimedia.org/goto/c90cfb80e16b8ce55e678d4a1ab368d8 ( Api::QueryRevisions::run errors doubled after the failover) to try to see a pattern (is it just less resources?). Identifying the root issue is the priority, not fixing it- as we will have less information after Wednesday (back on eqiad).

Spot checking some of the queries with explain on db2071, the failing queries seem to be making the usual poor index choices, mostly rev_timestamp where page_timestamp would make more sense.

Many are the same on db1072, but I found some that are different. For example,

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM `revision` INNER JOIN `page` ON ((page_id = rev_page)) WHERE rev_page = '438960' ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 2 ;

is currently explaining with rev_timestamp on db2071 versus page_timestamp on db1072.

I am not making a point here, just dumping the information I just got, to try to see a pattern:

root@neodymium:~$ mysql -h db1072.eqiad.wmnet -e "SELECT version()"
+-----------------+
| version()       |
+-----------------+
| 10.0.29-MariaDB |
+-----------------+
root@neodymium:~$ mysql -h db2071.codfw.wmnet -e "SELECT version()"           
+-----------------+
| version()       |
+-----------------+
| 10.0.30-MariaDB |
+-----------------+
root@neodymium:~$ mysql -h db1072.eqiad.wmnet enwiki -e "SHOW CREATE TABLE revision\G SHOW CREATE TABLE page\G"
*************************** 1. row ***************************
       Table: revision
Create Table: CREATE TABLE `revision` (
  `rev_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `rev_page` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_text_id` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_comment` varbinary(255) DEFAULT NULL,
  `rev_user` int(5) unsigned NOT NULL DEFAULT '0',
  `rev_user_text` varbinary(255) NOT NULL DEFAULT '',
  `rev_timestamp` varbinary(14) NOT NULL DEFAULT '',
  `rev_minor_edit` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_len` int(8) unsigned DEFAULT NULL,
  `rev_parent_id` int(8) unsigned DEFAULT NULL,
  `rev_sha1` varbinary(32) NOT NULL DEFAULT '',
  `rev_content_model` varbinary(32) DEFAULT NULL,
  `rev_content_format` varbinary(64) DEFAULT NULL,
  PRIMARY KEY (`rev_id`),
  KEY `rev_timestamp` (`rev_timestamp`),
  KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
  KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
  KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
  KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
  KEY `rev_page_id` (`rev_page`,`rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=778193443 DEFAULT CHARSET=binary
*************************** 1. row ***************************
       Table: page
Create Table: CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `page_namespace` int(11) NOT NULL DEFAULT '0',
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_restrictions` tinyblob NOT NULL,
  `page_counter` bigint(20) unsigned NOT NULL DEFAULT '0',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_random` double unsigned NOT NULL DEFAULT '0',
  `page_touched` varbinary(14) NOT NULL DEFAULT '',
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT '0',
  `page_len` int(8) unsigned NOT NULL DEFAULT '0',
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `name_title` (`page_namespace`,`page_title`),
  KEY `page_random` (`page_random`),
  KEY `page_len` (`page_len`),
  KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=53931678 DEFAULT CHARSET=binary
root@neodymium:~$ mysql -h db2071.codfw.wmnet enwiki -e "SHOW CREATE TABLE revision\G SHOW CREATE TABLE page\G"
*************************** 1. row ***************************
       Table: revision
Create Table: CREATE TABLE `revision` (
  `rev_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `rev_page` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_text_id` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_comment` varbinary(255) DEFAULT NULL,
  `rev_user` int(5) unsigned NOT NULL DEFAULT '0',
  `rev_user_text` varbinary(255) NOT NULL DEFAULT '',
  `rev_timestamp` varbinary(14) NOT NULL DEFAULT '',
  `rev_minor_edit` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_len` int(8) unsigned DEFAULT NULL,
  `rev_parent_id` int(8) unsigned DEFAULT NULL,
  `rev_sha1` varbinary(32) NOT NULL DEFAULT '',
  `rev_content_model` varbinary(32) DEFAULT NULL,
  `rev_content_format` varbinary(64) DEFAULT NULL,
  PRIMARY KEY (`rev_id`),
  KEY `rev_timestamp` (`rev_timestamp`),
  KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
  KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
  KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
  KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
  KEY `rev_page_id` (`rev_page`,`rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=778193525 DEFAULT CHARSET=binary
*************************** 1. row ***************************
       Table: page
Create Table: CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `page_namespace` int(11) NOT NULL DEFAULT '0',
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_restrictions` tinyblob NOT NULL,
  `page_counter` bigint(20) unsigned NOT NULL DEFAULT '0',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_random` double unsigned NOT NULL DEFAULT '0',
  `page_touched` varbinary(14) NOT NULL DEFAULT '',
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT '0',
  `page_len` int(8) unsigned NOT NULL DEFAULT '0',
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `name_title` (`page_namespace`,`page_title`),
  KEY `page_random` (`page_random`),
  KEY `page_len` (`page_len`),
  KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=53931682 DEFAULT CHARSET=binary
root@neodymium:~$ mysql -h db1072.eqiad.wmnet -e "SHOW GLOBAL STATUS like 'uptime'"
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| Uptime        | 7258742 |
+---------------+---------+
root@neodymium:~$ mysql -h db2071.codfw.wmnet -e "SHOW GLOBAL STATUS like 'uptime'"
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| Uptime        | 978724 |
+---------------+--------+

Xqt added a subtask: T164816: Breaking change might occur with mw 1.30.May 9 2017, 6:51 AM

I would like to create a separate ticket to continue investigating Anomie's findings and close this as resolved.

Xqt changed the status of subtask T164816: Breaking change might occur with mw 1.30 from Open to Stalled.Jul 14 2017, 8:10 PM

Krinkle removed a subtask: T164816: Breaking change might occur with mw 1.30.Aug 1 2017, 4:28 PM

Krinkle mentioned this in T164816: Breaking change might occur with mw 1.30.

In T163495#3246546, @jcrespo wrote:

I would like to create a separate ticket to continue investigating Anomie's findings and close this as resolved.

@Anomie: did that ever happen? This task has sat in Wikimedia-Incident's "active situation' column for a long time :)

Not that I know of. @jcrespo was the one who said he wanted to do things, he'd know better whether he did so.

I wouldn't know what more to investigate, anyway. It seems to need deep database administration things that he knows and I don't.

Let's close this- by the time another failover happens things will be so different that it will be not worth spending time on this, while it will be worth spending time on something else.

Things were done since then- api servers where renewed, data structure unified, and watchdogs enforced so while issues are likely to reappear- not the same ones again.

ack, thanks both

(btw, I pinged you @Anomie because of the request from DBAs in the last SoS to try to shield them from non-emergency things while they try to put out all the other fires first :))

@Anomie: your patch is still unmerged (https://gerrit.wikimedia.org/r/#/c/349479/) should that be merged first before resolving?