Page MenuHomePhabricator

Mediawiki revision-related queries are failing with high rate for enwiki on codfw
Closed, ResolvedPublic

Description

On Datacenter switchover, existing latent problems surfaced causing an overload (T163351) of enwiki (and to less extent, other wikis) databases. This is the nth case on an outage related to bad performace queries part of mediawiki core- the outages are so frequent, that we no longer report each individual incident: ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20170320-Special_AllPages ) The reasons for these were several:

  • codfw, being depooled, is easier to keep up to date with the latest mediwiki structure, so some schema changes on HEAD were already merged on codfw and not yet on eqiad (but they are being merged now there). The current mediawiki schema is failing *a lot* for enwiki and other large wikis. Yes, newer versions of HEAD are less stable than older ones, database-wise.
  • eqiad had some servers with SSDs and very large memory sizes, which mitigate, but do not solve completely the issue
  • codfw is cold (buffers were mostly empty), something that cannot be guarantee at any time on eqiad- because servers crash, there are emergencies, hardware changes, increases in requests, etc. It is ok for queries to be slower than usual in this situation, but not so slow that they cannot run and overload all available hardware.
  • After lots of problems showing on eqiad, there are undocuemnted hacks and workarounds like query rewrites and special analytics in place- those should not be needed, and are abset on codfw (and now, on eqiad, too)

That makes currently codfw performance incredibly unbearable -performance said a 50% increase in average latency, and it has caused complains mostly from API users that fail to execute (but many different queries are affected, it just happens that API users are most vocal).

Link: https://logstash.wikimedia.org/goto/7f4de766d19711bea1afc6eec6e0cf7c (the translate-related queries spike can be ignored, it is not part of the scope)

failover.png (319×1 px, 14 KB)

Proposals:

  • Limit execution time for webrequests in an effective way, so that the application is aware of queries that will never finish and kills them and reacts appopiately on DOS of extreme load. For starters, I am going to work on T160984 for a more strict db-side query killer, but that is a (bad) hack, and something should be considered on HHVM or mediwiki layer for longer term
  • Go over the list of top list of slow queries on codfw P5295 (again, ignore the translation and wikidata ones) and workaround the ones on top now, so they can be executed- even if later there is a more long term plan. Assign them to the respective teams and work on them so the same issue doesn't happen again when things fail back to eqiad (where the latest revision structure changes have already almost finished applying)
  • Review the revision structure and indexes, and try to reduce those or anything necessary- as having 6 indexes is creating huge trouble to mysql query planner to chose the right index. This was not properly tested on an enwiki-sized environment (400GB revision table), and clearly doesn't scale for it.

This should be done before the failback (not the long term stuff, but quick measures to solve the ongoing issues), or another outage will most likely happen again.

Event Timeline

SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_text_id, rev_timestamp, rev_comment, rev_user_text, rev_user, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, rev_content_format, rev_content_model, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, old_id, old_text, old_flags FROM `revision` INNER JOIN `page` ON ((page_id = rev_page)) INNER JOIN `text` ON ((rev_text_id=old_id)) WHERE (rev_id<='35367939') AND rev_page = '138161' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 26 /* 699b8c7a81e8de411afbc1ddf6042bba db2059 dewiki 18s */

This is a bug, it pages on rev_id but sorts by rev_timestamp. It uses the rev_page_id index, which I think shouldn't even exist.

SELECT /* ApiQueryContributors::execute */ rev_page AS `page`, COUNT(DISTINCT rev_user_text) AS `anons` FROM `revision` WHERE rev_page = '100796' AND (rev_user = 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page /* 0d63de509f37f53cc2d9a07e800a21cb db2062 enwiki 13s */

This was slow when I tested it the first time (6s) and fast after that (0.02s), suggesting the necessary data was not in memory. It scans 3384 rows, and this is done unconditionally for every request, while logged-in user names are queried with limit/continue. So using the continue parameter to page through all results in batches of 50 will scan some 230k rows.

Options for rectification:

  • Maintain per-page revision statistics, update anon count on save
  • Remove the separate anon count query, remove rev_user != 0 condition in the main query, have PHP count anons in the result set. Require clients to add together paged anon counts.
  • Scale back the feature. This module is presumably for OCG, which presents text like "Contributors: The Rambling Man, Yoninah, [...] and Anonymous: 21". Instead it could say "and anonymous contributors", then we could limit the query to a single row. Redefine anoncontributors to return 0 for no anons or 1 if there were anons, instead of the actual count.

The code notes:

/** We don't want to process too many pages at once (it hits cold
 * database pages too heavily), so only do the first MAX_PAGES input pages
 * in each API call (leaving the rest for continuation).
 */
const MAX_PAGES = 100;

MAX_PAGES is probably too high, if a single page take 13s.

Hey, thanks, can you and @Anomie help me do a sprint and commit as many fixes as possible before the fail back? For example, for

SELECT rev_user_text FROM revision WHERE rev_page = '482256' ORDER BY rev_timestamp DESC LIMIT 100

it gets confused between (rev_page,rev_user, rev_timestamp) and (rev_page, rev_timestamp) because how many rows are there and how similar the index are. I can try running ANALYZE (but it iakes 3 days to run per server on that gigantic table), so FORCE INDEX may be needed.

Moving this comment to the subticket.

and this is done unconditionally for every request

That's easy enough to fix.

  • Maintain per-page revision statistics, update anon count on save

That could be useful. Look at stuff in the web UI action=info and API prop=info for more things that could be stored in there.

  • Remove the separate anon count query, remove rev_user != 0 condition in the main query, have PHP count anons in the result set. Require clients to add together paged anon counts.

Lists of every IP address to ever edit the page would probably not be hugely useful. It'd also take some rethinking of the query used, since currently it tries to use a prefix on the page_user_timestamp index.

  • Scale back the feature. This module is presumably for OCG, which presents text like "Contributors: The Rambling Man, Yoninah, [...] and Anonymous: 21". Instead it could say "and anonymous contributors", then we could limit the query to a single row. Redefine anoncontributors to return 0 for no anons or 1 if there were anons, instead of the actual count.

Yes, it was created to support OCG. A boolean would be better in that case. It would need someone to change OCG, I have no idea if it rides a train or what.

MAX_PAGES is probably too high, if a single page take 13s.

Suggestions?

Change 349479 had a related patch set uploaded (by Anomie):
[mediawiki/core@master] ApiQueryContributors: Improve query behavior

https://gerrit.wikimedia.org/r/349479

@jcrespo Any next steps we can help with? https://logstash.wikimedia.org/goto/f06abf4c683a9ebe13ca347bfed2c432 still shows a lot, but not sure if there are any left we suspect are related to codfw in particular (and/or the schema differences from eqiad).

ApiQueryRevisions::run should be the next target- It fails at a higher rate than on eqiad, but I am not sure that is in scope for this particular ticket, because it already failed a lot on eqiad (just more spiky and with ~double the rate on codfw).

I would ask you or Anomie to give a quick look at https://logstash.wikimedia.org/goto/c90cfb80e16b8ce55e678d4a1ab368d8 ( Api::QueryRevisions::run errors doubled after the failover) to try to see a pattern (is it just less resources?). Identifying the root issue is the priority, not fixing it- as we will have less information after Wednesday (back on eqiad).

In summary:

  • Identify why Api::QueryRevisions::run is failing more frequently on codfw (if possible)
  • Then close this ticket as with this and the subtickets the initial goal has been fulfilled
  • Continue fixing API queries on separate tickets, with less priority

For the record LocalRepo::FindFiles failed heavily around the failover times, but not anymore: https://logstash.wikimedia.org/goto/51e78fc3a2cff810e92f5edaf0f48a26 however, based on the timestamps, it was unrelated to the switchover and probably release-related.

I would ask you or Anomie to give a quick look at https://logstash.wikimedia.org/goto/c90cfb80e16b8ce55e678d4a1ab368d8 ( Api::QueryRevisions::run errors doubled after the failover) to try to see a pattern (is it just less resources?). Identifying the root issue is the priority, not fixing it- as we will have less information after Wednesday (back on eqiad).

Spot checking some of the queries with explain on db2071, the failing queries seem to be making the usual poor index choices, mostly rev_timestamp where page_timestamp would make more sense.

Many are the same on db1072, but I found some that are different. For example,

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM `revision` INNER JOIN `page` ON ((page_id = rev_page)) WHERE rev_page = '438960' ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 2 ;

is currently explaining with rev_timestamp on db2071 versus page_timestamp on db1072.

I am not making a point here, just dumping the information I just got, to try to see a pattern:

root@neodymium:~$ mysql -h db1072.eqiad.wmnet -e "SELECT version()"
+-----------------+
| version()       |
+-----------------+
| 10.0.29-MariaDB |
+-----------------+
root@neodymium:~$ mysql -h db2071.codfw.wmnet -e "SELECT version()"           
+-----------------+
| version()       |
+-----------------+
| 10.0.30-MariaDB |
+-----------------+
root@neodymium:~$ mysql -h db1072.eqiad.wmnet enwiki -e "SHOW CREATE TABLE revision\G SHOW CREATE TABLE page\G"
*************************** 1. row ***************************
       Table: revision
Create Table: CREATE TABLE `revision` (
  `rev_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `rev_page` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_text_id` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_comment` varbinary(255) DEFAULT NULL,
  `rev_user` int(5) unsigned NOT NULL DEFAULT '0',
  `rev_user_text` varbinary(255) NOT NULL DEFAULT '',
  `rev_timestamp` varbinary(14) NOT NULL DEFAULT '',
  `rev_minor_edit` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_len` int(8) unsigned DEFAULT NULL,
  `rev_parent_id` int(8) unsigned DEFAULT NULL,
  `rev_sha1` varbinary(32) NOT NULL DEFAULT '',
  `rev_content_model` varbinary(32) DEFAULT NULL,
  `rev_content_format` varbinary(64) DEFAULT NULL,
  PRIMARY KEY (`rev_id`),
  KEY `rev_timestamp` (`rev_timestamp`),
  KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
  KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
  KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
  KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
  KEY `rev_page_id` (`rev_page`,`rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=778193443 DEFAULT CHARSET=binary
*************************** 1. row ***************************
       Table: page
Create Table: CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `page_namespace` int(11) NOT NULL DEFAULT '0',
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_restrictions` tinyblob NOT NULL,
  `page_counter` bigint(20) unsigned NOT NULL DEFAULT '0',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_random` double unsigned NOT NULL DEFAULT '0',
  `page_touched` varbinary(14) NOT NULL DEFAULT '',
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT '0',
  `page_len` int(8) unsigned NOT NULL DEFAULT '0',
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `name_title` (`page_namespace`,`page_title`),
  KEY `page_random` (`page_random`),
  KEY `page_len` (`page_len`),
  KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=53931678 DEFAULT CHARSET=binary
root@neodymium:~$ mysql -h db2071.codfw.wmnet enwiki -e "SHOW CREATE TABLE revision\G SHOW CREATE TABLE page\G"
*************************** 1. row ***************************
       Table: revision
Create Table: CREATE TABLE `revision` (
  `rev_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `rev_page` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_text_id` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_comment` varbinary(255) DEFAULT NULL,
  `rev_user` int(5) unsigned NOT NULL DEFAULT '0',
  `rev_user_text` varbinary(255) NOT NULL DEFAULT '',
  `rev_timestamp` varbinary(14) NOT NULL DEFAULT '',
  `rev_minor_edit` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_len` int(8) unsigned DEFAULT NULL,
  `rev_parent_id` int(8) unsigned DEFAULT NULL,
  `rev_sha1` varbinary(32) NOT NULL DEFAULT '',
  `rev_content_model` varbinary(32) DEFAULT NULL,
  `rev_content_format` varbinary(64) DEFAULT NULL,
  PRIMARY KEY (`rev_id`),
  KEY `rev_timestamp` (`rev_timestamp`),
  KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
  KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
  KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
  KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
  KEY `rev_page_id` (`rev_page`,`rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=778193525 DEFAULT CHARSET=binary
*************************** 1. row ***************************
       Table: page
Create Table: CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `page_namespace` int(11) NOT NULL DEFAULT '0',
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_restrictions` tinyblob NOT NULL,
  `page_counter` bigint(20) unsigned NOT NULL DEFAULT '0',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `page_random` double unsigned NOT NULL DEFAULT '0',
  `page_touched` varbinary(14) NOT NULL DEFAULT '',
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT '0',
  `page_len` int(8) unsigned NOT NULL DEFAULT '0',
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `name_title` (`page_namespace`,`page_title`),
  KEY `page_random` (`page_random`),
  KEY `page_len` (`page_len`),
  KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=53931682 DEFAULT CHARSET=binary
root@neodymium:~$ mysql -h db1072.eqiad.wmnet -e "SHOW GLOBAL STATUS like 'uptime'"
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| Uptime        | 7258742 |
+---------------+---------+
root@neodymium:~$ mysql -h db2071.codfw.wmnet -e "SHOW GLOBAL STATUS like 'uptime'"
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| Uptime        | 978724 |
+---------------+--------+

I would like to create a separate ticket to continue investigating Anomie's findings and close this as resolved.

I would like to create a separate ticket to continue investigating Anomie's findings and close this as resolved.

@Anomie: did that ever happen? This task has sat in Wikimedia-Incident's "active situation' column for a long time :)

Not that I know of. @jcrespo was the one who said he wanted to do things, he'd know better whether he did so.

I wouldn't know what more to investigate, anyway. It seems to need deep database administration things that he knows and I don't.

Let's close this- by the time another failover happens things will be so different that it will be not worth spending time on this, while it will be worth spending time on something else.

Things were done since then- api servers where renewed, data structure unified, and watchdogs enforced so while issues are likely to reappear- not the same ones again.

ack, thanks both

(btw, I pinged you @Anomie because of the request from DBAs in the last SoS to try to shield them from non-emergency things while they try to put out all the other fires first :))

@Anomie: your patch is still unmerged (https://gerrit.wikimedia.org/r/#/c/349479/) should that be merged first before resolving?

I don't know. If you decide it should, you should try to find someone willing to review it.

Closing this based on Jaime's last comments.

Change 349479 abandoned by Umherirrender:

[mediawiki/core@master] ApiQueryContributors: Improve query behavior

Reason:

Old outdated patch set

https://gerrit.wikimedia.org/r/349479