Cache invalidations coming from the JobQueue are causing lag on several wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Apr 30 2017, 7:00 PM

Description

Edit: This used to be punctual at some point, now it is causing generalized edit slowdowns and inability to read recentchanges/watchlists on several wikis.

The most impacted ones are s2 and s7, but looks generalized? https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&from=1493556218361&to=1493577818361&var-dc=codfw%20prometheus%2Fops

It started at 16:57 2017-04-30 and multiplied db network usage 10 times: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1493569412579&to=1493572762650&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=master

It could be:
https://id.wikipedia.org/w/index.php?title=Istimewa:Kontribusi_pengguna/Wagino_Bot&offset=20170430172107&limit=500&target=Wagino+Bot

But the edit rate doesn't seem too crazy.

s2 master at the time: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2017&from=1493564488494&to=1493578728690

I do not expect this to create continued issues- but wanted to report it and see if there is something to improve/tune about async processing speed.

Edit: Varnish doesn't look happy at that time: https://grafana.wikimedia.org/dashboard/db/varnish-traffic?refresh=5m&panelId=10&fullscreen&orgId=1&from=1493486799140&to=1493638662298

Edit2 Apparently varnish is ok with the load

Details

Subject	Repo	Branch	Lines +/-
Use HTMLCacheUpdateJob instead of invalidateCache	mediawiki/extensions/Wikibase	master	+77 -101
Use multi-page RefreshLinksJobs in WikiPageUpdater	mediawiki/extensions/Wikibase	master	+66 -39
Wait for replication after adding RC entries in WikiPageUpdater	mediawiki/extensions/Wikibase	master	+97 -1

Customize query in gerrit

Revisions and Commits

rEWBA extension-Wikibase
		rEWBA24d45d6d0cfc Wait for replication after adding RC entries in WikiPageUpdater
	Audited	rEWBA2987fe5bcf62 Move injection of RC records to a separate job.
		rEWBA172b739396bd Remove overengineered page update code from ChangeHandler
		rEWBA4d8e2b80d700 Use HTMLCacheUpdateJob instead of invalidateCache
		rEWBA0df49eeaf49d Use multi-page RefreshLinksJobs in WikiPageUpdater

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	Release	thcipriani	T170632 1.30.0-wmf.14 deployment blockers
Resolved		daniel	T171877 Update grafana dashboard for WikiPageUpdater to use new statsd keys
Resolved		aaron	T175897 Audit and improve JobQueue stability and performance (2017)
Resolved		daniel	T164173 Cache invalidations coming from the JobQueue are causing lag on several wikis
Resolved	PRODUCTION ERROR	None	T171370 ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope"
Resolved	PRODUCTION ERROR	daniel	T172320 Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set.
Resolved		daniel	T172394 InjectRCRecordsJob: include the serialize EntityChange object in the job spec

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the long term: how useful is this field, and could it be separated from the rest of the table if it happens to be hotter than the rest of the columns?

The query does not come from HTMLCacheUpdateJob (which calls HTMLCacheUpdateJob::invalidateTitles) or seemingly anything else in the core /jobqueue/jobs directory.

If something causes lag, it's not surprising the queues would delay somewhat (due to lag checks).

It would be nice to have stack traces. I do see that ChangeNotificationJob seems to work on a variable number of pages, using for-loops with no obvious waitForReplication() checks. That could related to the several DBPerformance entries around the time, such as:

/rpc/RunJobs.php?wiki=commonswiki&type=ChangeNotification&maxtime=60&maxmem=300M

Expectation (writeQueryTime <= 5) by JobRunner::run not met (actual: 45.238716363907):
[transaction 0de1b8 writes to 10.64.48.23 (commonswiki)]
#0 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/TransactionProfiler.php(284): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/database/Database.php(2805): Wikimedia\Rdbms\TransactionProfiler->transactionWritingOut()
#2 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1230): Wikimedia\Rdbms\Database->commit()
#3 (): Closure$Wikimedia\Rdbms\LoadBalancer::commitMasterChanges()
#4 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1498): call_user_func_array()
#5 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1242): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection()
#6 (): Wikimedia\Rdbms\LoadBalancer->commitMasterChanges()
#7 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(183): call_user_func_array()
#8 (): Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod()
#9 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(417): call_user_func_array()
#10 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(186): Wikimedia\Rdbms\LBFactoryMulti->forEachLB()
#11 /srv/mediawiki/php-1.30.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(227): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod()
#12 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(566): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#13 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(291): JobRunner->commitMasterChanges()
#14 /srv/mediawiki/php-1.30.0-wmf.1/includes/jobqueue/JobRunner.php(189): JobRunner->executeJob()
#15 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#16 {main}

Note sure how much of that is write size vs contention; e.g. evidence of the later:

	Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#137c8a)]: 
0	0.000463	query-m: INSERT INTO `recentchanges` (rc_user,rc_user_text,rc_comment,rc_type,rc_minor,rc_bot,rc_patrolled,rc_params,rc_timestamp,rc_log_type,rc_log_action,rc_source,rc_deleted,rc_new,rc_namespace,rc_title,rc_old_len,rc_new_len,rc_this_oldid,rc_last_oldid,rc_cur_i [TRX#137c8a]
1	1.036443	query: SELECT eu_page_id,eu_entity_id,eu_aspect FROM `wbc_entity_usage` WHERE eu_entity_id = 'X')  [TRX#da9550]
2	29.182062	...delay...
3	0.407057	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
4	261.907087	...delay...
5	0.404889	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
6	155.734977	...delay...
7	0.405216	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
8	22.719406	...delay...
9	0.404474	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
10	33.385787	...delay...
11	0.410447	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
12	4.256901	...delay...
13	0.406822	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
14	25.997826	...delay...
15	0.410249	query: SELECT page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang FROM `page` WHERE page_id = 'X' LIMIT N  [TRX#da9550]
16	98.733947	...delay...

...which is a long time to hold recentchanges row locks. OTOH, I see no indication of that contention having to do with the page table writes.

aaron added a project: Wikidata.May 19 2017, 8:03 PM

Lydia_Pintscher added subscribers: hoo, Ladsgroup.May 19 2017, 9:29 PM

While contention is bad in general- it is the opposite of lag- more contention would create less lag - of course it could be a common source: large updates causing contention on the master, and then lag because the transaction size is large.

I wonder if instead of fixing each individual query, we should wrap database writes with some kind of monitoring layer that would measure (with heavy caching, of course) the db state and throttle things automatically. I could even increase the pt-heartbeat rate to have more accurate lag measuring to go into sub-second metrics. All of this is probably off-topic here.

aaron triaged this task as Medium priority.May 26 2017, 5:06 PM

Krinkle mentioned this in T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan.May 26 2017, 5:35 PM

@daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see any throttling or lag checks on in the job code.

aaron moved this task from Doing (old) to Radar on the Performance-Team board.Jun 14 2017, 7:34 PM

• MZMcBride subscribed.Jun 20 2017, 3:38 AM

In T164173#3343495, @aaron wrote:

@daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see any throttling or lag checks on in the job code.

/rpc/RunJobs.php?wiki=commonswiki&type=ChangeNotification&maxtime=60&maxmem=300M

Expectation (maxAffected <= 500) by JobRunner::run not met (actual: 89936):
[transaction f7dd6f writes to 10.64.48.23 (commonswiki)]
#0 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/TransactionProfiler.php(294): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php(2888): Wikimedia\Rdbms\TransactionProfiler->transactionWritingOut()
#2 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php(756): Wikimedia\Rdbms\Database->commit()
#3 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1084): Wikimedia\Rdbms\Database->close()
#4 (): Closure$Wikimedia\Rdbms\LoadBalancer::closeAll()
#5 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1485): call_user_func_array()
#6 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1085): Wikimedia\Rdbms\LoadBalancer->forEachOpenConnection()
#7 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1076): Wikimedia\Rdbms\LoadBalancer->closeAll()
#8 /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1665): Wikimedia\Rdbms\LoadBalancer->disable()
#9 (): Wikimedia\Rdbms\LoadBalancer->__destruct()
#10 {main}

If those are page_touched updates, that would a pretty large number of purges to make in one job at once.

aaron removed aaron as the assignee of this task.Jul 6 2017, 10:25 PM

aaron subscribed.

ChangeNotificationJob https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/6cfd514ee9/client/includes/ChangeNotificationJob.php#L99

ChangeHandler::handleChanges() -> handleChange() -> applyUpdateAction()
ChangeHandler::applyUpdateAction(): One of: purgeParserCache(), purgeWebCache(), scheduleRefreshLinks(), or injectRCRecords().
- Defined at https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/6cfd514ee99/client/includes/Changes/WikiPageUpdater.php#L74-L147
  - PageUpdater::purgeParserCache() -> foreach titles: Title::invalidateCache()
  - PageUpdater::purgeWebCache() -> foreach titles: Title::purgeSquid()
  - PageUpdater::scheduleRefreshLinks() -> foreach titles: JobQueueGroup->push( new RefreshLinksJob )
  - PageUpdater::injectRCRecords() -> foreach titles: Recentchange::save()

Both the recentchanges INSERT and the page.page_touched UPDATE queries come from this job.

Krinkle raised the priority of this task from Medium to High.Jul 6 2017, 10:42 PM

I also wonder why some of those log warnings come from close() and others have the proper commitMasterChanges() bit in the stack trace. Normally, there should be nothing to commit by close() and it is just commits for sanity.

Probably related: T169884

I also wonder why some of those log warnings come from close() and others have the proper commitMasterChanges() bit in the stack trace. Normally, there should be nothing to commit by close() and it is just commits for sanity.

We were theorizing the other day on IRC that the connection was trying to close a connection that was already closed. Maybe the query killer kills long running queries and the connection handler doesn't like that?

daniel added a project: User-Daniel.Jul 7 2017, 7:00 PM

Ladsgroup claimed this task.Jul 9 2017, 5:46 AM

Ladsgroup added projects: MediaWiki-extensions-WikibaseClient, Wikidata-Former-Sprint-Board.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 9 2017, 5:46 AM

Change 364094 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Wait for replication after macking changes in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

gerritbot added a project: Patch-For-Review.Jul 9 2017, 5:47 AM

In T164173#3413805, @Krinkle wrote:

PageUpdater::purgeParserCache() -> foreach titles: Title::invalidateCache()

My patch adds waitForReplication() to this.

PageUpdater::purgeWebCache() -> foreach titles: Title::purgeSquid()

I'm not sure about adding waitForReplication here, I checked the code and couldn't find any place that it changes database. Correct me if I'm wrong

PageUpdater::scheduleRefreshLinks() -> foreach titles: JobQueueGroup->push( new RefreshLinksJob )

RefreshLinksJob has waitForReplication() already

PageUpdater::injectRCRecords() -> foreach titles: Recentchange::save()

My patch adds waitForReplication() to this.

Ladsgroup moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.Jul 9 2017, 5:50 AM

Ladsgroup moved this task from Incoming to Blocked on others on the User-Ladsgroup board.Jul 9 2017, 10:54 AM

@aaron I'm trying to wrap my head around the code in Title::invalidateCache(), but I could use some guidance. What we want to do is efficiently invalidate the parser cache for a set of pages, not just a single page. I could easily create a AutoCommitUpdate objects that do this in batches of, say, 100 titles. But it's unclear to me how this interacts with automatic transactions. We'd want to wait for replication after each batch, but we can only do that if the transaction was committed. Should I start a new transaction in each AutoCommitUpdate object? Or should I use commitAndWaitForReplication() in the callback?

@jcrespo Looking at T169884, it seems that it's an unrelated issue triggered by EmailNotification->notifyOnPageChange(). As far as I understand, notifyOnPageChange should not be triggered by parser cache invalidation, right? If it is, please let me know what code path leads from UpdateNotificationJob or WikiPageUpdater to notifyOnPageChange().

@aaron I'm wondering whether WikiPageUpdater::purgeParserCache() is needed at all, if we call WikiPageUpdater::scheduleRefreshLinks() anyway. It's redundant, right?

Also, RefreshLinksJob is strange an wonderful... we currently do:

			$this->jobQueueGroup->push( $job );
			$this->jobQueueGroup->deduplicateRootJob( $job );

...with a separate job for each title. But I'm guessing we should use a single job for all titles, and then just do

			$this->jobQueueGroup->lazyPush( $job );

Is that right?

Change 364198 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Use multi-page RefreshLinksJobs in WikiPageUpdater

https://gerrit.wikimedia.org/r/364198

Ladsgroup reassigned this task from Ladsgroup to daniel.Jul 10 2017, 1:31 PM

Agabi10 subscribed.Jul 10 2017, 1:53 PM

Change 364205 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Use RefreshLinksJob instead of Title:invalidateCache

https://gerrit.wikimedia.org/r/364205

@aaron another question: does RefreshLinksJob also purge the CDN cache automatically? should it? It does update the parser cache...

Krinkle moved this task from Radar to Inbox, needs triage on the Performance-Team board.Jul 10 2017, 10:57 PM

In T164173#3420723, @daniel wrote:

@aaron another question: does RefreshLinksJob also purge the CDN cache automatically? should it? It does update the parser cache...

It saves the cache as a convenience in some cases (since the relevant htmlCacheUpdate job uses the rootJobTimestamp in setting page_touched, so some cache misses can be avoided by doing so). The only CDN purges refreshLinks does is minor thinks like file description pages (since they list usage links).

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Jul 12 2017, 6:52 PM

Change 364094 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Wait for replication after adding RC entries in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

Lucas_Werkmeister_WMDE moved this task from Review to Doing on the Wikidata-Former-Sprint-Board board.Jul 18 2017, 10:06 AM

In T164173#3435975, @gerritbot wrote:

Change 364094 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Wait for replication after adding RC entries in WikiPageUpdater

https://gerrit.wikimedia.org/r/364094

@daniel @Ladsgroup This caused an error spike in Logstash:
https://logstash.wikimedia.org/goto/709280746172b68115f62db346b06201

Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: Wikibase\Client\Changes\WikiPageUpdater::injectRCRecords does not have outer scope.
#0 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/WikiPageUpdater.php(161): Wikimedia\Rdbms\LBFactory->getEmptyTransactionTicket()
#1 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(250): Wikibase\Client\Changes\WikiPageUpdater->injectRCRecords()
#2 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(169): Wikibase\Client\Changes\ChangeHandler->applyUpdateAction()
#3 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/ChangeHandler.php(137): Wikibase\Client\Changes\ChangeHandler->handleChange()
#4 /srv/mediawiki/php-1.30.0-wmf.10/extensions/Wikidata/extensions/Wikibase/client/includes/ChangeNotificationJob.php(99): Wikibase\Client\Changes\ChangeHandler->handleChanges()
#5 /srv/mediawiki/php-1.30.0-wmf.10/includes/jobqueue/JobRunner.php(293): Wikibase\Client\ChangeNotificationJob->run()
#6 /srv/mediawiki/php-1.30.0-wmf.10/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob()
#7 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#8 {main}

Filed separately as T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope".

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 22 2017, 2:28 AM

Krinkle mentioned this in T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope".Jul 22 2017, 2:30 AM

I commented on the patch, it's a METHOD mismatch problem, so the commit/wait steps don't happen (just the one big one like before).

daniel moved this task from Doing to Review on the Wikidata-Former-Sprint-Board board.Jul 25 2017, 4:20 PM

daniel added a subtask: T171877: Update grafana dashboard for WikiPageUpdater to use new statsd keys.Jul 27 2017, 5:43 PM

Krinkle removed a project: Patch-For-Review.Jul 29 2017, 4:38 AM

thiemowmde added a project: Patch-For-Review.Aug 1 2017, 10:26 AM

thiemowmde updated the task description. (Show Details)

Change 364198 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use multi-page RefreshLinksJobs in WikiPageUpdater

https://gerrit.wikimedia.org/r/364198

Change 364205 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use HTMLCacheUpdateJob instead of invalidateCache

https://gerrit.wikimedia.org/r/364205

Lucas_Werkmeister_WMDE moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.Aug 2 2017, 10:02 AM

Krinkle added a subtask: T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope".Aug 2 2017, 7:59 PM

Krinkle added a subtask: T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set..

Krinkle removed a project: Patch-For-Review.Aug 3 2017, 4:01 AM

Krinkle edited projects, added Regression; removed Traffic.Aug 3 2017, 11:22 PM

Krinkle mentioned this in T172447: Investigate 2017-08-02 Save Timing regression (+40-60%).Aug 4 2017, 10:49 PM

• mmodell added a parent task: T170631: 1.30.0-wmf.13 deployment blockers.Aug 7 2017, 4:48 PM

Jrbranaa subscribed.Aug 7 2017, 11:30 PM

Krinkle updated the task description. (Show Details)Aug 8 2017, 12:38 AM

Krinkle added commits: rEWBA0df49eeaf49d: Use multi-page RefreshLinksJobs in WikiPageUpdater, rEWBA4d8e2b80d700: Use HTMLCacheUpdateJob instead of invalidateCache, rEWBA172b739396bd: Remove overengineered page update code from ChangeHandler, rEWBA2987fe5bcf62: Move injection of RC records to a separate job., rEWBA24d45d6d0cfc: Wait for replication after adding RC entries in WikiPageUpdater.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

I've been told that several thousands of UPDATES Title::invalidateCache per second had caused trouble on s7 over the night, not sure if this is related:

https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops&panelId=7&fullscreen&from=1502261251242&to=1502347651243

UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '645152' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '645704' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646071' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646352' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '646766' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '647384' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649512' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649759' AND (page_touched < '20170809234330')
UPDATE /* Title::invalidateCache  */  `page` SET page_touched = '20170809234330' WHERE page_id = '649849' AND (page_touched < '20170809234330')

To avoid the continuous lagging on non-directly pooled hosts (passive dc codfw, labs, other hosts replicating on a second tier), I have forced a slowdown of writes to go at the pace of the slowest slaves of eqiad with semisync replication, adding automaticaly a pause of up to 100ms between writes. I do not know if that will have an effect such as an increase of the job queue pendding tasks or a slowdown on the rate of edits. CC Performance-Team.

jcrespo mentioned this in T172882: s51187 and p50380g50692 database users are generating excessive lag on replica service.Aug 10 2017, 8:32 AM

@daniel this happened again yesterday evening (Montreal time, and we got some pages for the slow slaves, not being able to catch up with the master) as you can see on Jaime's comments.
I am in Wikimania, so if you like we can try to discuss a bit how to get this properly fixed in code, as the hack we have done (throttling this with semisync replication on the slow slaves) might have secondary effects that we are not aware of.

Thanks!

@Marostegui @daniel: Please keep me in the loop when discussing this.

• mmodell edited parent tasks, added: T170632: 1.30.0-wmf.14 deployment blockers; removed: T170631: 1.30.0-wmf.13 deployment blockers.Aug 10 2017, 6:46 PM

@daniel @hoo and myself got together to talk about this yesterday and looks like the patch that is expected to fix this is ready and was already merged.

However due to some wikidata releases rollback it is not yet in production, but they expect it to be released in the next few days.

Thank you very much guys!

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

In T164173#3527782, @Marostegui wrote:

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

Any fix that landed in wikibase before the wmf.14 build (https://github.com/wikimedia/mediawiki-extensions-WikiData/tree/wmf/1.30.0-wmf.14) will roll to group1 wikis with the train happening at 19:00UTC today.

greg removed a subtask: T171877: Update grafana dashboard for WikiPageUpdater to use new statsd keys.Aug 16 2017, 5:45 PM

greg added a parent task: T171877: Update grafana dashboard for WikiPageUpdater to use new statsd keys.

In T164173#3528312, @thcipriani wrote:

In T164173#3527782, @Marostegui wrote:

@hoo this happened again just now. As we talked during wikimania here is the ping again...do you happen to know if this is about to be released this week in the end? :-)

Any fix that landed in wikibase before the wmf.14 build (https://github.com/wikimedia/mediawiki-extensions-WikiData/tree/wmf/1.30.0-wmf.14) will roll to group1 wikis with the train happening at 19:00UTC today.

Thanks! Does that mean it would need to appear here?: https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.14

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Aug 16 2017, 6:55 PM

Ladsgroup mentioned this in T173462: LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending.Aug 17 2017, 7:04 AM

jcrespo raised the priority of this task from High to Unbreak Now!.Aug 17 2017, 3:53 PM

Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptAug 17 2017, 3:53 PM

Setting as unbreak now because this is preventing collaborators from editing articles, which causes not only to slowdown volunteers, also causing large frustration among them: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784 If 90% of the edits go away because this bug, we should give it top priority. CC Editing-team Community-Tech .

jcrespo updated the task description. (Show Details)Aug 17 2017, 4:08 PM

jcrespo mentioned this in T173269: Wikidata.org currently very slow.Aug 17 2017, 4:13 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-17T16:54:53Z] <thcipriani@tin> rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.14 now for T164173

@thcipriani @aaron I know something was done yesterday, (thank you!), may I ask for an update of the state, to know if the UBN is still active?

The fix is deployed and so I think this task should be closed, can you see anything in the logs that implies it's still happening?

Ladsgroup closed subtask T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. as Resolved.Aug 18 2017, 9:01 AM

In T164173#3533047, @Ladsgroup wrote:

The fix is deployed and so I think this task should be closed, can you see anything in the logs that implies it's still happening?

It was not easy to see as it would happen all of a sudden.
I would prefer to leave it open for a few days until we can see this is gone for good. It was happening almost every single day lately, so in a few days if it has not happened, I would consider it solved :-)

As Jaime asked above, if something was already deployed (let's wait for confirmation) we could remove the UBN

Maybe we should consider this fixed? it has not happened again since Thursday https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-4d&to=now

Thanks! Feel free to reopen in case it started to happen again.

Krinkle closed subtask T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope" as Resolved.Aug 23 2017, 12:11 AM

Liuxinyu970226 unsubscribed.Aug 23 2017, 4:33 AM

jcrespo mentioned this in T173710: Job queue is increasing non-stop.Sep 7 2017, 1:15 PM

Krinkle added a parent task: T175897: Audit and improve JobQueue stability and performance (2017).Sep 14 2017, 8:54 AM

	F8808868: Screen Shot 2017-07-21 at 19.27.46.png
	Jul 22 2017, 2:28 AM

	F8123568: s4_lag.png
	May 19 2017, 4:01 PM

Cache invalidations coming from the JobQueue are causing lag on several wikisClosed, ResolvedPublicActions