Dispatchers occasionally seem to "freeze" for certain wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hoo
	Oct 26 2017, 9:45 AM

Description

In the recent weeks the dispatchers occasionally seemed to stop dispatching to certain wikis for a certain time, leading to huge dispatching backlogs. The symptoms for this are a huge dispatching backlog (mostly on just one wiki, though) and very little CPU usage of the dispatchers on terbium.

See T178624: enwiki dispatch lag is 8 hours for one (rather extreme) example of this.

I looked into this briefly and found T178652, which is possibly related to this, but certainly isn't the root cause.

This also happened yesterday with svwiki and on inspection it seems that a stale lock for that wiki was held with the redis LockManager. In order to rule out a coincidence here, I would like to look at another such example…

Details

	Subject	Repo	Branch	Lines +/-
	Remove bogus "lock" condition from SqlChangeDispatchCoordinator query	mediawiki/extensions/Wikibase	master	+7 -9
	mediawiki::maintenace::wikidata: log Wikidata dispatchers	operations/puppet	production	+7 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T108944 [Epic] Improve change dispatching
Resolved	hoo	T179060 Dispatchers occasionally seem to "freeze" for certain wikis
Resolved	Addshore	T178624 enwiki dispatch lag is 8 hours
Resolved	Addshore	T178652 Wikidata dispatchers should use a LockManager with a short TTL

Event Timeline

hoo created this task.Oct 26 2017, 9:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2017, 9:45 AM

hoo added projects: Wikidata, MediaWiki-extensions-WikibaseRepository.Oct 26 2017, 9:45 AM

hoo added subtasks: T178624: enwiki dispatch lag is 8 hours, T178652: Wikidata dispatchers should use a LockManager with a short TTL.

Change 386591 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Log Wikidata dispatchers on terbium

https://gerrit.wikimedia.org/r/386591

gerritbot added a project: Patch-For-Review.Oct 26 2017, 9:50 AM

hoo triaged this task as High priority.Oct 26 2017, 9:51 AM

hoo added a parent task: T108944: [Epic] Improve change dispatching.

Change 386591 merged by Elukey:
[operations/puppet@production] mediawiki::maintenace::wikidata: log Wikidata dispatchers

https://gerrit.wikimedia.org/r/386591

This just happened again, zhwiki got backlogged by 8h+.

It was not locked at this time:

hoo@terbium:~$ mwscript eval.php --wiki wikidatawiki
> $lockManager = LockManagerGroup::singleton( wfWikiID() )->get( 'redisLockManager' );

> $clientLockName = str_replace( ' ', '_', "Wikibase.wikidatawiki.dispatchChanges.zhwiki" );

>  var_dump( $lockManager->lock( [ $clientLockName ] ) );
object(StatusValue)#355 (6) {
  ["ok":protected]=>
  bool(true)
  ["errors":protected]=>
  array(0) {
  }
  ["value"]=>
  NULL
  ["success"]=>
  array(0) {
  }
  ["successCount"]=>
  int(0)
  ["failCount"]=>
  int(0)
}

>  var_dump( $lockManager->unlock( [ $clientLockName ] ) );
object(StatusValue)#355 (6) {
  ["ok":protected]=>
  bool(true)
  ["errors":protected]=>
  array(0) {
  }
  ["value"]=>
  NULL
  ["success"]=>
  array(0) {
  }
  ["successCount"]=>
  int(0)
  ["failCount"]=>
  int(0)
}

There are actually a lot of changes related to zhwiki happening:

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM (SELECT * FROM recentchanges WHERE rc_source = 'mw.edit' ORDER BY rc_timestamp DESC LIMIT 100000) AS innerQ INNER JOIN wb_changes_subscription ON rc_title = cs_entity_id AND rc_namespace = 0 AND cs_subscriber_id = 'zhwiki';
+----------+
| COUNT(*) |
+----------+
|    46145 |
+----------+
1 row in set (21.56 sec)

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM (SELECT * FROM recentchanges WHERE rc_source = 'mw.edit' ORDER BY rc_timestamp DESC LIMIT 100000) AS innerQ INNER JOIN wb_changes_subscription ON rc_title = cs_entity_id AND rc_namespace = 0 AND cs_subscrib = 'enwiki';
+----------+
| COUNT(*) |
+----------+
|     9038 |
+----------+
1 row in set (1.75 sec)

Apparently there's a bot adding a lot of Welsh descriptions to Chinese villages and that seems to be causing these problems. See https://www.wikidata.org/wiki/Special:Contributions/Danielt998.

Change 387114 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Remove bogus "lock" condition from SqlChangeDispatchCoordinator query

https://gerrit.wikimedia.org/r/387114

The above change will allow us to (almost) increase our maximum throughput by a factor of 8, thus should fix these issues for now at least.

hoo added a project: Wikidata-Former-Sprint-Board.Oct 29 2017, 9:41 PM

hoo moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.

Change 387114 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Remove bogus "lock" condition from SqlChangeDispatchCoordinator query

https://gerrit.wikimedia.org/r/387114

hoo closed this task as Resolved.Oct 30 2017, 10:40 AM

hoo claimed this task.

hoo removed a project: Patch-For-Review.

hoo moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)).Oct 30 2017, 7:01 PM

Addshore closed subtask T178652: Wikidata dispatchers should use a LockManager with a short TTL as Resolved.Jul 26 2018, 2:11 PM

Dispatchers occasionally seem to "freeze" for certain wikisClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Dispatchers occasionally seem to "freeze" for certain wikis
Closed, ResolvedPublic
Actions

Related Objects
Search...