Page MenuHomePhabricator

SpecialRecentChangesLinked::doMainQuery blocking database infrastructure
Closed, DuplicatePublicPRODUCTION ERROR

Description

Msnbot hit yesterday night between 10:07 UTC and performed >500 requests like:

(SELECT /* SpecialRecentChangesLinked::doMainQuery <IP> */ rc_id, rc_timestamp, rc_user, rc_user_text, rc_namespace, rc_title, rc_comment, rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid, rc_log_type, rc_log_action, rc_params, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ', ') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags` FROM `recentchanges` INNER JOIN `pagelinks` ON ((rc_namespace = pl_namespace) AND (rc_title = pl_title)) WHERE rc_bot = '0' AND (rc_type != '6') AND (rc_timestamp >= '20160503000000') AND (rc_source != 'wb') AND pl_from = '1279706' ORDER BY rc_timestamp DESC LIMIT 50 ) UNION (SELECT rc_id, rc_timestamp, rc_user, rc_user_text, rc_namespace, rc_title, rc_comment, rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid, rc_log_type, rc_log_action, rc_params, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ', ') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags` FROM `recentchanges` INNER JOIN `templatelinks` ON ((rc_namespace = tl_namespace) AND (rc_title = tl_title)) WHERE rc_bot = '0' AND (rc_type != '6') AND (rc_timestamp >= '20160503000000') AND (rc_source != 'wb') AND tl_from = '1279706' ORDER BY rc_timestamp DESC LIMIT 50 ) UNION (SELECT rc_id, rc_timestamp, rc_user, rc_user_text, rc_namespace, rc_title, rc_comment, rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid, rc_log_type, rc_log_action, rc_params, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ', ') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags` FROM `recentchanges` INNER JOIN `imagelinks` ON ((rc_title = il_to)) WHERE rc_bot = '0' AND (rc_type != '6') AND (rc_timestamp >= '20160503000000') AND (rc_source != 'wb') AND il_from = '1279706' AND rc_namespace = '6' ORDER BY rc_timestamp DESC LIMIT 50 ) ORDER BY rc_timestamp DESC LIMIT 50

that took between 20 and 40 seconds, exhausting the number of connections available, creating lag and 5000-10000 failed requests:

Screenshot from 2016-05-11 10:09:06.png (760×2 px, 39 KB)

https://logstash.wikimedia.org/#dashboard/temp/AVSfBVO70z-7ykXOX4eb

I am only 99% sure this is the cause, there were other issues happening at the same time, but I think much less impacting: ApiQueryContributions::execute, which I believe it is already a known, but minor issue.

I am pinging releng because I believe this could be related to recent code deploy and potentially cause worse issues if more widely deployed, so I am asking to:

a) pause the train to analyze if new code could be responsible of that (I want to remember that RecentChangesLinked is a new functionality, with some issues in the past)
b) know who in mediawiki is responsible for it and assign the appropriate tags (this is of course not an issue for you to directly manage)
c) Check if robots.txt is correct, and bounce back to operations if needed

All queries have hit db1034.eqiad.wmnet 10.64.16.23 (example wikis: eswiki kowiki hewiki fawiki ).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have made this task a blocker of T134249: MW-1.28.0-wmf.1 deployment blockers

We had 1.27.0-wmf.23 rollbacked last week and it has been put back in place on Monday May 9th at 21:56 UTC

2016-05-09 21:56	<demon@tin>	rebuilt wikiversions.php and synchronized wikiversions files: pedias back to wmf.23

From logstash an example URL is https://es.wikipedia.org/wiki/Especial:CambiosEnEnlazadas/Monedas_de_euro_de_Polonia

Some more events a bit earlier https://logstash.wikimedia.org/#dashboard/temp/AVSfHoS10z-7ykXOYPFH

On cawiki ApiQueryContributions::execute the connection is lost at 22:05:30.

From 22:06:20 there is a burst of Error connecting to 10.64.16.23: Can't connect to MySQL server on '10.64.16.23' which are triggered by the various async jobs we run. Last is at 22:08:01.

This task is old (relatively, given it was marked as blocking 1.28.0-wmf1), untriaged, and open.

What are the next steps here?

Sounds like a job for PoolCounter to me.

Special:RecentChangesLinked has always been slow in the worst case.

Sounds like a job for PoolCounter to me.

Special:RecentChangesLinked has always been slow in the worst case.

@Bawolff : Ideas on who to own this? It sounds performance-y, but so do a lot of things.

Sounds like a job for PoolCounter to me.

Special:RecentChangesLinked has always been slow in the worst case.

@Bawolff : Ideas on who to own this? It sounds performance-y, but so do a lot of things.

Any others?

Gerrit-bot doesn't add comment when I make a patch (T161525)

Change 348378 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Disable ORES in Recentchangeslinked

https://gerrit.wikimedia.org/r/348378

Ladsgroup removed a project: User-Ladsgroup.

My patch that got merged helps but the underlying issue is still there and fixing it is way beyond my knowledge or time. Maybe the MediaWiki-Platform-Team can take a look. Once this is resolved, probably we can enable ores on this special page again.

(Feel free to remove the project, it's just a suggestion)

Change 349271 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.29.0-wmf.20] Disable ORES in Recentchangeslinked

https://gerrit.wikimedia.org/r/349271

Btw, it seems the root cause was two things:

  • straight_join preventing efficient queries in case the associated page has a few links (e.g. not the giant category use case)
  • the query killer not recognizing union queries.

The straight_join was recently removed from the watchlist case, and i believe the query killer was fixed. If straight_join gets killed here to, the page will probably be just as efficient as the pre-ORES case.

edit: i thought i was commenting on a different bug. Leaving my comment as it half makes sense for this bug.

Still seen at https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors and https://logstash.wikimedia.org/app/kibana#/dashboard/hhvm under +SpecialRecentChangesLinked.

Sample from mediawik-errors
channel: DBPerformance
wiki: enwiki
url: /wiki/Special:RecentChangesLinked/Category:Wikipedia_articles_with_LCCN_identifiers

message:
Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 5.36559009552):
query: SELECT rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,COALESCE( comment_rc_comment.co [TRX#4f86f2]
#0 /srv/mediawiki/php-1.32.0-wmf.14/includes/libs/rdbms/TransactionProfiler.php(223): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.32.0-wmf.14/includes/libs/rdbms/database/Database.php(1243): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion()
#2 /srv/mediawiki/php-1.32.0-wmf.14/includes/libs/rdbms/database/Database.php(1143): Wikimedia\Rdbms\Database->doProfiledQuery()
#3 /srv/mediawiki/php-1.32.0-wmf.14/includes/specials/SpecialRecentchangeslinked.php(231): Wikimedia\Rdbms\Database->query()
#4 /srv/mediawiki/php-1.32.0-wmf.14/includes/specialpage/ChangesListSpecialPage.php(944): SpecialRecentChangesLinked->doMainQuery()
#5 /srv/mediawiki/php-1.32.0-wmf.14/includes/specialpage/ChangesListSpecialPage.php(633): ChangesListSpecialPage->getRows()
#6 /srv/mediawiki/php-1.32.0-wmf.14/includes/specials/SpecialRecentchanges.php(167): ChangesListSpecialPage->execute()
#7 /srv/mediawiki/php-1.32.0-wmf.14/includes/specialpage/SpecialPage.php(566): SpecialRecentChanges->execute()
#8 /srv/mediawiki/php-1.32.0-wmf.14/includes/specialpage/SpecialPageFactory.php(569): SpecialPage->run()
#9 /srv/mediawiki/php-1.32.0-wmf.14/includes/MediaWiki.php(288): SpecialPageFactory::executePath()
#10 /srv/mediawiki/php-1.32.0-wmf.14/includes/MediaWiki.php(867): MediaWiki->performRequest()
#11 /srv/mediawiki/php-1.32.0-wmf.14/includes/MediaWiki.php(524): MediaWiki->main()
#12 /srv/mediawiki/php-1.32.0-wmf.14/index.php(42): MediaWiki->run()
#13 /srv/mediawiki/w/index.php(3): include()
#14 {main}
Sample from hhvm
SlowTimer [10088ms] at runtime/ext_mysql: slow query: SELECT /* SpecialRecentChangesLinked::doMainQuery <CUT:IP> */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,COALESCE( comment_rc_comment.comment_text, rc_comment ) AS `rc_comment_text`,comment_rc_comment.comment_data AS `rc_comment_data`,comment_rc_comment.comment_id AS `rc_comment_cid`,rc_user,rc_user_text,NULL AS `rc_actor`,page_latest,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`    WHERE ct_rc_id=rc_id  ) AS `ts_tags`,fp_stable,fp_pending_since,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,ores_goodfaith_cls.oresc_probability AS `ores_goodfaith_score`  FROM `recentchanges` LEFT JOIN `comment` `comment_rc_comment` ON ((comment_rc_comment.comment_id = rc_comment_id)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `flaggedpages` <CUT>   ORDER BY rc_timestamp DESC  LIMIT 50

@Halfak @CCicalese_WMF Sorry to ask, but it seems both your teams moved the task to a "Radar"-like column, which suggests both assumed the other to take action.

The feature in question (ORES on Recentchangeslinked) was disabled, but appears to have been re-enabled now (unsure since when). That seems to be in-scope for Scoring-team. On the other hand, @Ladsgroup expressed it might require expertise from Platform to move forward. Thoughts?

The initial report showed a query which didn't even use ORES, so it seems unfair to assign it to them.

I want to remember that RecentChangesLinked is a new functionality, with some issues in the past

Actually it has existed since the first version of MediaWiki, although the performance has gotten steadily worse over time, in line with Wirth's law.

Special:RecentChangesLinked causes performance problems because it has the potential to generate queries which are a massive intersection between recentchanges and a links table. It's similar to the watchlist, except that the pages that generate the massive intersections are public and linkable instead of being confined to the few users with large watchlists. There are templates that are used by >10% of all pages. MySQL would have the choice of scanning 8M rows in the recentchanges table looking for templatelinks matches, versus say 4M rows in the templatelinks table looking for recentchanges matches. Neither option looks all that great, especially if you then add an unindexed filter which matches no rows, forcing MySQL to scan the intersection to completion. Either ORES or change_tags could play the role of unindexed filter — EXPLAIN SELECT suggests there is no prospect of them being used as the first table in a query plan even if there is an index which could theoretically prove that they match no rows.

Since 9dc2b29fb56a075c92d24e0084b7c43bd867f90b, RCL has the charming ability to follow links in either direction.

What are the options?

  • Scale up/out. enwiki.recentchanges is only 3GB, you can provide a copy on as many servers as you like. The links tables are much larger, and could be sharded. Query every shard in parallel, joining with a local copy of recentchanges.
  • Have a links table which is the denormalized cartesian product of rc_timestamp and the existing links tables. Essentially you would precompute every possible RCL page by inserting every link together with its rev_timestamp into a table indexed by link identity and timestamp.
  • Gracefully degrade.
    • Cache aggressively. We could assume that query results don't change except by addition of new revisions. So if we have a cached result set from yesterday and serve the same query today, we only need to scan 1 day of recentchanges and append it to the old result set, we don't need to scan the whole table again. The results will be wrong if the links change, but maybe that's OK.
    • Keep statistics about the links tables and refuse to filter a big intersection.
    • Use the query killer.

I guess I am fine with using the query killer for now as long as it is not causing actual problems.

@tstarling Thanks for checking. In processing the backlog I mixed up this issue with T163063, which the conversation on this task went on about for a bit as well. That one was marked resolved, so I guess the underlying core issue remains. There reason the sql-join to ORES-tables shows up in most queries is presumably just because ORES and RCFilters are enabled by default now.

cicalese moved this task from Watching to Backlog on the Core-Platform-Team-Old board.

@cicalese -- the Growth team considered looking into this ticket, but it sounds like your team is doing it. Is that right?

@MMiller_WMF Sorry for the delay. We are not actively working it at this point, so if you get to it first, that would be great.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM

Change 348378 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Disable ORES in Recentchangeslinked

https://gerrit.wikimedia.org/r/348378

Looking at slow queries today (Tendril and Logstash) I see that Ores now part of the Recentchangeslinked queries again. Whether that is what is making them slow is unknown, but they were slow before when it was enabled, and they are slow today with it enabled.

Related:

For example:

Tendril slow query
tmax: 28 seconds
wiki: enwiki

SELECT /* SpecialRecentChangesLinked::doMainQuery ##### */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid, rc_log_type, rc_log_action, rc_params, comment_rc_comment.comment_text AS `rc_comment_text`, comment_rc_comment.comment_data AS `rc_comment_data`, comment_rc_comment.comment_id AS `rc_comment_cid`, actor_rc_user.actor_user AS `rc_user`, actor_rc_user.actor_name AS `rc_user_text`, rc_actor, page_latest, (SELECT GROUP_CONCAT(ctd_name SEPARATOR ', ') FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id)) WHERE ct_rc_id=rc_id ) AS `ts_tags`, fp_stable, fp_pending_since, ores_damaging_cls.oresc_probability AS `ores_damaging_score`, ores_goodfaith_cls.oresc_probability AS `ores_goodfaith_score` FROM `recentchanges` JOIN `comment` `comment_rc_comment` ON ((comment_rc_comment.comment_id = rc_comment_id)) JOIN `actor` `actor_rc_user` ON ((actor_rc_user.actor_id = rc_actor)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `flaggedpages` ON ((fp_page_id = rc_cur_id)) LEFT JOIN `ores_classification` `ores_damaging_cls` ON (ores_damaging_cls.oresc_model = 37 AND (ores_damaging_cls.oresc_rev=rc_this_oldid) AND ores_damaging_cls.oresc_class = 1) LEFT JOIN `ores_classification` `ores_goodfaith_cls` ON (ores_goodfaith_cls.oresc_model = 38 AND (ores_goodfaith_cls.oresc_rev=rc_this_oldid) AND ores_goodfaith_cls.oresc_class = 1) JOIN `categorylinks` ON ((rc_cur_id = cl_from)) WHERE rc_bot = 0 AND (rc_type != 6) AND (rc_source != 'wb') AND (rc_timestamp >= '20200616160908') AND cl_to = 'Webarchive_template_wayback_links' ORDER BY rc_timestamp DESC LIMIT 50 /* ##### db1099:3311 enwiki 13s */