Page MenuHomePhabricator

Cache invalidations coming from the JobQueue are causing lag on several wikis
Closed, ResolvedPublic

Description

Edit: This used to be punctual at some point, now it is causing generalized edit slowdowns and inability to read recentchanges/watchlists on several wikis.

The most impacted ones are s2 and s7, but looks generalized? https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&from=1493556218361&to=1493577818361&var-dc=codfw%20prometheus%2Fops

It started at 16:57 2017-04-30 and multiplied db network usage 10 times: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1493569412579&to=1493572762650&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=master

It could be:
https://id.wikipedia.org/w/index.php?title=Istimewa:Kontribusi_pengguna/Wagino_Bot&offset=20170430172107&limit=500&target=Wagino+Bot

But the edit rate doesn't seem too crazy.

s2 master at the time: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2017&from=1493564488494&to=1493578728690

I do not expect this to create continued issues- but wanted to report it and see if there is something to improve/tune about async processing speed.

Edit: Varnish doesn't look happy at that time: https://grafana.wikimedia.org/dashboard/db/varnish-traffic?refresh=5m&panelId=10&fullscreen&orgId=1&from=1493486799140&to=1493638662298

Edit2 Apparently varnish is ok with the load

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the long term: how useful is this field, and could it be separated from the rest of the table if it happens to be hotter than the rest of the columns?

The query does not come from HTMLCacheUpdateJob (which calls HTMLCacheUpdateJob::invalidateTitles) or seemingly anything else in the core /jobqueue/jobs directory.

If something causes lag, it's not surprising the queues would delay somewhat (due to lag checks).

It would be nice to have stack traces. I do see that ChangeNotificationJob seems to work on a variable number of pages, using for-loops with no obvious waitForReplication() checks. That could related to the several DBPerformance entries around the time, such as:

/rpc/RunJobs.php?wiki=commonswiki&type=ChangeNotification&maxtime=60&maxmem=300M
Expectation (writeQueryTime <= 5) by JobRunner::run not met (actual: 45.238716363907):
[transaction 0de1b8 writes to 10.64.48.23 (commonswiki)]
#0 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/TransactionProfiler.php(284): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/database/Database.php(2805): Wikimedia\Rdbms\TransactionProfiler->transactionWritingOut()
#2 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1230): Wikimedia\Rdbms\Database->commit()
#3 (): Closure$Wikimedia\Rdbms\LoadBalancer::commitMasterChanges()
#4 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1498): call_user_func_array()
#5 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1242): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection()
#6 (): Wikimedia\Rdbms\LoadBalancer->commitMasterChanges()
#7 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(183): call_user_func_array()
#8 (): Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod()
#9 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(417): call_user_func_array()
#10 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(186): Wikimedia\Rdbms\LBFactoryMulti->forEachLB()
#11 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(227): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod()
#12 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(566): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#13 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(291): JobRunner->commitMasterChanges()
#14 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(189): JobRunner->executeJob()
#15 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#16 {main}

Note sure how much of that is write size vs contention; e.g. evidence of the later:

	Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#137c8a)]: 
0	0.000463	query-m: INSERT INTO `recentchanges` (rc_user,rc_user_text,rc_comment,rc_type,rc_minor,rc_bot,rc_patrolled,rc_params,rc_timestamp,rc_log_type,rc_log_action,rc_source,rc_deleted,rc_new,rc_namespace,rc_title,rc_old_len,rc_new_len,rc_this_oldid,rc_last_oldid,rc_cur_i [TRX#137c8a]
1	1.036443	query: SELECT eu_page_id,eu_entity_id,eu_aspect FROM `wbc_entity_usage` WHERE eu_entity_id = 'X')  [TRX#da9550]
2	29.182062	...delay...
3	0.407057	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
4	261.907087	...delay...
5	0.404889	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
6	155.734977	...delay...
7	0.405216	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
8	22.719406	...delay...
9	0.404474	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
10	33.385787	...delay...
11	0.410447	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
12	4.256901	...delay...
13	0.406822	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
14	25.997826	...delay...
15	0.410249	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
16	98.733947	...delay...

...which is a long time to hold recentchanges row locks. OTOH, I see no indication of that contention having to do with the page table writes.

While contention is bad in general- it is the opposite of lag- more contention would create less lag - of course it could be a common source: large updates causing contention on the master, and then lag because the transaction size is large.

I wonder if instead of fixing each individual query, we should wrap database writes with some kind of monitoring layer that would measure (with heavy caching, of course) the db state and throttle things automatically. I could even increase the pt-heartbeat rate to have more accurate lag measuring to go into sub-second metrics. All of this is probably off-topic here.

aaron triaged this task as Medium priority.May 26 2017, 5:06 PM

@daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see any throttling or lag checks on in the job code.

@daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see any throttling or lag checks on in the job code.

/rpc/RunJobs.php?wiki=commonswiki&type=ChangeNotification&maxtime=60&maxmem=300M

Expectation (maxAffected <= 500) by JobRunner::run not met (actual: 89936):
[transaction f7dd6f writes to 10.64.48.23 (commonswiki)]
#0 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/TransactionProfiler.php(294): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php(2888): Wikimedia\Rdbms\TransactionProfiler->transactionWritingOut()
#2 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php(756): Wikimedia\Rdbms\Database->commit()
#3 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1084): Wikimedia\Rdbms\Database->close()
#4 (): Closure$Wikimedia\Rdbms\LoadBalancer::closeAll()
#5 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1485): call_user_func_array()
#6 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1085): Wikimedia\Rdbms\LoadBalancer->forEachOpenConnection()
#7 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1076): Wikimedia\Rdbms\LoadBalancer->closeAll()
#8 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1665): Wikimedia\Rdbms\LoadBalancer->disable()
#9 (): Wikimedia\Rdbms\LoadBalancer->__destruct()
#10 {main}

If those are page_touched updates, that would a pretty large number of purges to make in one job at once.

aaron removed aaron as the assignee of this task.Jul 6 2017, 10:25 PM
aaron subscribed.

ChangeNotificationJob https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/6cfd514ee9/client/includes/ChangeNotificationJob.php#L99

  • ChangeHandler::handleChanges() -> handleChange() -> applyUpdateAction()
  • ChangeHandler::applyUpdateAction(): One of: purgeParserCache(), purgeWebCache(), scheduleRefreshLinks(), or injectRCRecords().

Both the recentchanges INSERT and the page.page_touched UPDATE queries come from this job.

Krinkle raised the priority of this task from Medium to High.Jul 6 2017, 10:42 PM

I also wonder why some of those log warnings come from close() and others have the proper commitMasterChanges() bit in the stack trace. Normally, there should be nothing to commit by close() and it is just commits for sanity.

I also wonder why some of those log warnings come from close() and others have the proper commitMasterChanges() bit in the stack trace. Normally, there should be nothing to commit by close() and it is just commits for sanity.

We were theorizing the other day on IRC that the connection was trying to close a connection that was already closed. Maybe the query killer kills long running queries and the connection handler doesn't like that?

Change 364094 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Wait for replication after macking changes in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

  • PageUpdater::purgeParserCache() -> foreach titles: Title::invalidateCache()

My patch adds waitForReplication() to this.

  • PageUpdater::purgeWebCache() -> foreach titles: Title::purgeSquid()

I'm not sure about adding waitForReplication here, I checked the code and couldn't find any place that it changes database. Correct me if I'm wrong

  • PageUpdater::scheduleRefreshLinks() -> foreach titles: JobQueueGroup->push( new RefreshLinksJob )

RefreshLinksJob has waitForReplication() already

  • PageUpdater::injectRCRecords() -> foreach titles: Recentchange::save()

My patch adds waitForReplication() to this.

@aaron I'm trying to wrap my head around the code in Title::invalidateCache(), but I could use some guidance. What we want to do is efficiently invalidate the parser cache for a set of pages, not just a single page. I could easily create a AutoCommitUpdate objects that do this in batches of, say, 100 titles. But it's unclear to me how this interacts with automatic transactions. We'd want to wait for replication after each batch, but we can only do that if the transaction was committed. Should I start a new transaction in each AutoCommitUpdate object? Or should I use commitAndWaitForReplication() in the callback?

@jcrespo Looking at T169884, it seems that it's an unrelated issue triggered by EmailNotification->notifyOnPageChange(). As far as I understand, notifyOnPageChange should not be triggered by parser cache invalidation, right? If it is, please let me know what code path leads from UpdateNotificationJob or WikiPageUpdater to notifyOnPageChange().

@aaron I'm wondering whether WikiPageUpdater::purgeParserCache() is needed at all, if we call WikiPageUpdater::scheduleRefreshLinks() anyway. It's redundant, right?

Also, RefreshLinksJob is strange an wonderful... we currently do:

			$this->jobQueueGroup->push( $job );
			$this->jobQueueGroup->deduplicateRootJob( $job );

...with a separate job for each title. But I'm guessing we should use a single job for all titles, and then just do

			$this->jobQueueGroup->lazyPush( $job );

Is that right?

Change 364198 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Use multi-page RefreshLinksJobs in WikiPageUpdater

https://gerrit.wikimedia.org/r/364198

Change 364205 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Use RefreshLinksJob instead of Title:invalidateCache

https://gerrit.wikimedia.org/r/364205

@aaron another question: does RefreshLinksJob also purge the CDN cache automatically? should it? It does update the parser cache...

@aaron another question: does RefreshLinksJob also purge the CDN cache automatically? should it? It does update the parser cache...

It saves the cache as a convenience in some cases (since the relevant htmlCacheUpdate job uses the rootJobTimestamp in setting page_touched, so some cache misses can be avoided by doing so). The only CDN purges refreshLinks does is minor thinks like file description pages (since they list usage links).

Change 364094 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Wait for replication after adding RC entries in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

Change 364094 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Wait for replication after adding RC entries in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

@daniel @Ladsgroup This caused an error spike in Logstash:
https://logstash.wikimedia.org/goto/709280746172b68115f62db346b06201

Screen Shot 2017-07-21 at 19.27.46.png (334×1 px, 49 KB)

Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: Wikibase\Client\Changes\WikiPageUpdater::injectRCRecords does not have outer scope.
#0 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/WikiPageUpdater.php(161): Wikimedia\Rdbms\LBFactory->getEmptyTransactionTicket()
#1 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(250): Wikibase\Client\Changes\WikiPageUpdater->injectRCRecords()
#2 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(169): Wikibase\Client\Changes\ChangeHandler->applyUpdateAction()
#3 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(137): Wikibase\Client\Changes\ChangeHandler->handleChange()
#4 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/ChangeNotificationJob.php(99): Wikibase\Client\Changes\ChangeHandler->handleChanges()
#5 /srv/mediawiki/php-1.30.0-wmf.10/includes/jobqueue/JobRunner.php(293): Wikibase\Client\ChangeNotificationJob->run()
#6 /srv/mediawiki/php-1.30.0-wmf.10/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob()
#7 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#8 {main}

Filed separately as T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope".

I commented on the patch, it's a METHOD mismatch problem, so the commit/wait steps don't happen (just the one big one like before).

Change 364198 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use multi-page RefreshLinksJobs in WikiPageUpdater

https://gerrit.wikimedia.org/r/364198

Change 364205 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use HTMLCacheUpdateJob instead of invalidateCache

https://gerrit.wikimedia.org/r/364205

I've been told that several thousands of UPDATES Title::invalidateCache per second had caused trouble on s7 over the night, not sure if this is related:

https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops&panelId=7&fullscreen&from=1502261251242&to=1502347651243

UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '645152' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '645704' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646071' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646352' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646766' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '647384' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649512' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649759' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649849' AND (page_touched < '20170809234330')

To avoid the continuous lagging on non-directly pooled hosts (passive dc codfw, labs, other hosts replicating on a second tier), I have forced a slowdown of writes to go at the pace of the slowest slaves of eqiad with semisync replication, adding automaticaly a pause of up to 100ms between writes. I do not know if that will have an effect such as an increase of the job queue pendding tasks or a slowdown on the rate of edits. CC Performance-Team.

@daniel this happened again yesterday evening (Montreal time, and we got some pages for the slow slaves, not being able to catch up with the master) as you can see on Jaime's comments.
I am in Wikimania, so if you like we can try to discuss a bit how to get this properly fixed in code, as the hack we have done (throttling this with semisync replication on the slow slaves) might have secondary effects that we are not aware of.

Thanks!

@Marostegui @daniel: Please keep me in the loop when discussing this.

@daniel @hoo and myself got together to talk about this yesterday and looks like the patch that is expected to fix this is ready and was already merged.

However due to some wikidata releases rollback it is not yet in production, but they expect it to be released in the next few days.

Thank you very much guys!

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

Any fix that landed in wikibase before the wmf.14 build (https://github.com/wikimedia/mediawiki-extensions-WikiData/tree/wmf/1.30.0-wmf.14) will roll to group1 wikis with the train happening at 19:00UTC today.

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

Any fix that landed in wikibase before the wmf.14 build (https://github.com/wikimedia/mediawiki-extensions-WikiData/tree/wmf/1.30.0-wmf.14) will roll to group1 wikis with the train happening at 19:00UTC today.

Thanks! Does that mean it would need to appear here?: https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.14

jcrespo raised the priority of this task from High to Unbreak Now!.Aug 17 2017, 3:53 PM

Setting as unbreak now because this is preventing collaborators from editing articles, which causes not only to slowdown volunteers, also causing large frustration among them: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784 If 90% of the edits go away because this bug, we should give it top priority. CC Editing-team Community-Tech .

Mentioned in SAL (#wikimedia-operations) [2017-08-17T16:54:53Z] <thcipriani@tin> rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.14 now for T164173

@thcipriani @aaron I know something was done yesterday, (thank you!), may I ask for an update of the state, to know if the UBN is still active?

The fix is deployed and so I think this task should be closed, can you see anything in the logs that implies it's still happening?

The fix is deployed and so I think this task should be closed, can you see anything in the logs that implies it's still happening?

It was not easy to see as it would happen all of a sudden.
I would prefer to leave it open for a few days until we can see this is gone for good. It was happening almost every single day lately, so in a few days if it has not happened, I would consider it solved :-)

As Jaime asked above, if something was already deployed (let's wait for confirmation) we could remove the UBN

Thanks! Feel free to reopen in case it started to happen again.