Page MenuHomePhabricator

Dispatchers occasionally seem to "freeze" for certain wikis
Closed, ResolvedPublic

Description

In the recent weeks the dispatchers occasionally seemed to stop dispatching to certain wikis for a certain time, leading to huge dispatching backlogs. The symptoms for this are a huge dispatching backlog (mostly on just one wiki, though) and very little CPU usage of the dispatchers on terbium.

See T178624: enwiki dispatch lag is 8 hours for one (rather extreme) example of this.

I looked into this briefly and found T178652, which is possibly related to this, but certainly isn't the root cause.

This also happened yesterday with svwiki and on inspection it seems that a stale lock for that wiki was held with the redis LockManager. In order to rule out a coincidence here, I would like to look at another such example…

Event Timeline

Change 386591 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Log Wikidata dispatchers on terbium

https://gerrit.wikimedia.org/r/386591

Change 386591 merged by Elukey:
[operations/puppet@production] mediawiki::maintenace::wikidata: log Wikidata dispatchers

https://gerrit.wikimedia.org/r/386591

This just happened again, zhwiki got backlogged by 8h+.

It was not locked at this time:

hoo@terbium:~$ mwscript eval.php --wiki wikidatawiki
> $lockManager = LockManagerGroup::singleton( wfWikiID() )->get( 'redisLockManager' );

> $clientLockName = str_replace( ' ', '_', "Wikibase.wikidatawiki.dispatchChanges.zhwiki" );

>  var_dump( $lockManager->lock( [ $clientLockName ] ) );
object(StatusValue)#355 (6) {
  ["ok":protected]=>
  bool(true)
  ["errors":protected]=>
  array(0) {
  }
  ["value"]=>
  NULL
  ["success"]=>
  array(0) {
  }
  ["successCount"]=>
  int(0)
  ["failCount"]=>
  int(0)
}

>  var_dump( $lockManager->unlock( [ $clientLockName ] ) );
object(StatusValue)#355 (6) {
  ["ok":protected]=>
  bool(true)
  ["errors":protected]=>
  array(0) {
  }
  ["value"]=>
  NULL
  ["success"]=>
  array(0) {
  }
  ["successCount"]=>
  int(0)
  ["failCount"]=>
  int(0)
}

There are actually a lot of changes related to zhwiki happening:

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM (SELECT * FROM recentchanges WHERE rc_source = 'mw.edit' ORDER BY rc_timestamp DESC LIMIT 100000) AS innerQ INNER JOIN wb_changes_subscription ON rc_title = cs_entity_id AND rc_namespace = 0 AND cs_subscriber_id = 'zhwiki';
+----------+
| COUNT(*) |
+----------+
|    46145 |
+----------+
1 row in set (21.56 sec)

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM (SELECT * FROM recentchanges WHERE rc_source = 'mw.edit' ORDER BY rc_timestamp DESC LIMIT 100000) AS innerQ INNER JOIN wb_changes_subscription ON rc_title = cs_entity_id AND rc_namespace = 0 AND cs_subscrib = 'enwiki';
+----------+
| COUNT(*) |
+----------+
|     9038 |
+----------+
1 row in set (1.75 sec)

Apparently there's a bot adding a lot of Welsh descriptions to Chinese villages and that seems to be causing these problems. See https://www.wikidata.org/wiki/Special:Contributions/Danielt998.

Change 387114 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Remove bogus "lock" condition from SqlChangeDispatchCoordinator query

https://gerrit.wikimedia.org/r/387114

The above change will allow us to (almost) increase our maximum throughput by a factor of 8, thus should fix these issues for now at least.

Change 387114 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Remove bogus "lock" condition from SqlChangeDispatchCoordinator query

https://gerrit.wikimedia.org/r/387114

hoo claimed this task.
hoo removed a project: Patch-For-Review.
hoo moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.