Page MenuHomePhabricator

Wikidata dispatching stuck (not releasing lockmanager locks)
Closed, ResolvedPublic

Event Timeline

matej_suchanek triaged this task as Unbreak Now! priority.Jul 26 2018, 11:51 AM
Lydia_Pintscher lowered the priority of this task from Unbreak Now! to High.Jul 26 2018, 11:52 AM
Lydia_Pintscher raised the priority of this task from High to Unbreak Now!.
Lydia_Pintscher added a subscriber: Addshore.

Looking at the logs on mwmaint1001 I am seeing logs of failing to grab locks.

T178652 would probably have made this situation recover quicker.

Addshore lowered the priority of this task from Unbreak Now! to High.Jul 26 2018, 12:09 PM

No idea what caused these locks to get stuck.

It looks like https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script is also currently broken, probably due to T196609 and the dashboard not yet being updated.

It looks like most wikis have now recovered their locks, I'll continue monitoring:

image.png (272×1 px, 57 KB)

Addshore raised the priority of this task from High to Unbreak Now!.Jul 26 2018, 12:14 PM

Looks like the same pattern as can be seen in that graph is now happening again and all locks seem to be taken and there is a failure to grab locks for dispatching.
Dispatch lag is again increasing....

Something to note, because the locks are no longer in the DB, we end up selecting the same 15 or so wikis that are locked all of the time.
It could be that the other wikis actually don't have locks:

12:34:54 Could not lock any of the candidate client wikis for dispatching
12:34:54 Failed to grab dispatch lock for plwiki
12:34:54 Failed to grab dispatch lock for ukwiki
12:34:54 Failed to grab dispatch lock for ruwiki
12:34:54 Failed to grab dispatch lock for zhwiki
12:34:54 Failed to grab dispatch lock for rowiki
12:34:54 Failed to grab dispatch lock for itwiki
12:34:54 Failed to grab dispatch lock for euwiki
12:34:54 Failed to grab dispatch lock for arwiki
12:34:54 Failed to grab dispatch lock for svwiki
12:34:54 Failed to grab dispatch lock for commonswiki
12:34:54 Failed to grab dispatch lock for huwiki
12:34:54 Failed to grab dispatch lock for eswiki
12:34:54 Failed to grab dispatch lock for ptwiki
12:34:54 Failed to grab dispatch lock for uzwiki
12:34:54 Failed to grab dispatch lock for enwiki
12:34:54 Could not lock any of the candidate client wikis for dispatching
12:34:54 Failed to grab dispatch lock for ptwiki
12:34:54 Failed to grab dispatch lock for zhwiki
12:34:54 Failed to grab dispatch lock for enwiki
12:34:54 Failed to grab dispatch lock for arwiki
12:34:54 Failed to grab dispatch lock for uzwiki
12:34:54 Failed to grab dispatch lock for eswiki
12:34:54 Failed to grab dispatch lock for huwiki
12:34:54 Failed to grab dispatch lock for itwiki
12:34:54 Failed to grab dispatch lock for euwiki
12:34:54 Failed to grab dispatch lock for commonswiki
12:34:54 Failed to grab dispatch lock for plwiki
12:34:54 Failed to grab dispatch lock for ruwiki
12:34:54 Failed to grab dispatch lock for rowiki
12:34:54 Failed to grab dispatch lock for ukwiki
12:34:54 Failed to grab dispatch lock for svwiki
12:34:54 Could not lock any of the candidate client wikis for dispatching

before using the redis lock manager the status of the lock from the db was also in the select so that locked dbs would not be selected at all.

Mentioned in SAL (#wikimedia-operations) [2018-07-26T12:38:03Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: wikidatawiki back to .13 T200420

Change 448034 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Wikidata dispatch, select 20 wikis instead of 15

https://gerrit.wikimedia.org/r/448034

Change 448034 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata dispatch, select 20 wikis instead of 15

https://gerrit.wikimedia.org/r/448034

Mentioned in SAL (#wikimedia-operations) [2018-07-26T12:44:01Z] <addshore@deploy1001> Synchronized wmf-config/Wikibase.php: T200420 - Wikidata, dispatch, select 20 instead of 15 wikis (duration: 00m 55s)

Change 448037 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] Verbose logging for wikidata dispatching

https://gerrit.wikimedia.org/r/448037

Change 448037 merged by Alexandros Kosiaris:
[operations/puppet@production] Verbose logging for wikidata dispatching

https://gerrit.wikimedia.org/r/448037

Mentioned in SAL (#wikimedia-operations) [2018-07-26T13:44:00Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Use new wikibase dispatch lock manager on wikidatawiki T200420 T178652 (duration: 00m 55s)

Change 448042 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Add a logger to wikibaseDispatchRedisLockManager

https://gerrit.wikimedia.org/r/448042

Change 448043 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Add wikibaseDispatchRedisLockManager to wmgMonologChannels

https://gerrit.wikimedia.org/r/448043

Change 448042 merged by jenkins-bot:
[operations/mediawiki-config@master] Add a logger to wikibaseDispatchRedisLockManager

https://gerrit.wikimedia.org/r/448042

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:34:37Z] <addshore@deploy1001> sync-file aborted: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:35:52Z] <addshore@deploy1001> Synchronized wmf-config/Wikibase.php: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 56s)

Change 448043 merged by jenkins-bot:
[operations/mediawiki-config@master] Add wikibaseDispatchRedisLockManager to wmgMonologChannels

https://gerrit.wikimedia.org/r/448043

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:38:52Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add wikibaseDispatchRedisLockManager to wmgMonologChannels T200420 (duration: 00m 54s)

I deployed T178652 which should hide some of the brokenness, as the locks should now only remain for 15 mins instead of an hour, but the underlying issue still remains.
I added a logger to the LockManager used for dispatching, but that hasn't logged anything yet...
Currently staring at RedisLockManager and QuorumLockManager.

While testing some stuff with eval.php if I run the following:

$lock = str_replace( ' ', '_',  "Wikibase.wikidatawiki.dispatchChanges.dewiki" );
$result = $lockManager->unlock( [ $lock ] );
var_dump( $result->__toString() );
$result = $lockManager->lock( [ $lock ] );
var_dump( $result->__toString() );

I get:

> $lock = str_replace( ' ', '_',  "Wikibase.wikidatawiki.dispatchChanges.dewiki" );

> $result = $lockManager->unlock( [ $lock ] );

> var_dump( $result->__toString() );
string(292) "<OK, collected 1 error(s) on the way, no value set>
+------+---------------------------+------------------------------------------+
|    1 | lockmanager-notlocked     | Wikibase.wikidatawiki.dispatchChanges.de |
+------+---------------------------+------------------------------------------+
"

> $result = $lockManager->lock( [ $lock ] );

> var_dump( $result->__toString() );
string(295) "<Error, collected 1 error(s) on the way, no value set>
+------+---------------------------+------------------------------------------+
|    1 | lockmanager-fail-acquirel | Wikibase.wikidatawiki.dispatchChanges.de |
+------+---------------------------+------------------------------------------+

Which seems odd, as I can neither unlock or lock the key.
If I try another key, for example "Wikibase.wikidatawiki.dispatchChanges.addshoretest" locking and unlocking works as expected.

Mentioned in SAL (#wikimedia-operations) [2018-07-26T15:21:32Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add LockManager logging T200420 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T15:27:20Z] <addshore@deploy1001> Synchronized wmf-config: Remove unused wikibaseDispatchRedisLockManager logging T200420 (duration: 00m 56s)

Un assigning myself for now as I'm not actively looking at this right now.
Due to the 15 min lock time out the dispatch lag is staying reasonably low, but the underlying issue is still not really known and not fixed.

Something to note, because the locks are no longer in the DB, we end up selecting the same 15 or so wikis that are locked all of the time.
It could be that the other wikis actually don't have locks:

before using the redis lock manager the status of the lock from the db was also in the select so that locked dbs would not be selected at all.

You mean the condition on chd_touched in getCandidateClients()? That would roughly correlate with lock state.

It's odd that engageClientLock() is blocking for the SQL coordinator but not the LockManager-using subclass (Database::lock is 5s by default). That would make the LM-based selectClient() more likely to not find anything and return false (if there are a few busy wikis), as it only does one pass with no usleep(). Since dispatchMaxTime is not PHP_INT_MAX (but 720) and I don't see --max-passes in puppet, then max-passes for dispatchChanges.php runs is 1, right? Thus, once selectClient() returns false, then dispatchChanges.php would exit afaik. The loop/delay (idle-delay) logic would not happen in such cases.

Something to note, because the locks are no longer in the DB, we end up selecting the same 15 or so wikis that are locked all of the time.
It could be that the other wikis actually don't have locks:

before using the redis lock manager the status of the lock from the db was also in the select so that locked dbs would not be selected at all.

You mean the condition on chd_touched in getCandidateClients()? That would roughly correlate with lock state.

It's odd that engageClientLock() is blocking for the SQL coordinator but not the LockManager-using subclass (Database::lock is 5s by default). That would make the LM-based selectClient() more likely to not find anything and return false (if there are a few busy wikis),

There should only really ever be a number of wikis busy equal to the number of dispatchers being run, so ~4 or 5?

as it only does one pass with no usleep(). Since dispatchMaxTime is not PHP_INT_MAX (but 720) and I don't see --max-passes in puppet, then max-passes for dispatchChanges.php runs is 1, right?

Some of the options are now in mediawiki-config, see https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Wikibase.php#L131-L134, so a maxtime is set and max passes will be infinate.

Ah, right, I read that ternary backwards, <<$maxTime < PHP_INT_MAX ? PHP_INT_MAX : 1>>.

So, it looks like when lockManager->unlock is called in LockManagerSqlChangeDispatchCoordinator chd_lock is used, which is null, so no lock is removed in redis....
I still have no idea how this triggered at 11am today, it looks like it will be due to .13 -> .14, but rolling that back didn't fix the problem.........

@aaron has made a patch to make Lockmanager throw when trying to unlock a null lock @ https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/448080/

The issue in Wikibase should be fixed first though to avoid dispatching throwing exceptions and dieing & spamming logs.

Change 448103 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Use getClientLockName value for releaseClientLock when dispatching

https://gerrit.wikimedia.org/r/448103

Change 448113 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.14] Use getClientLockName value for releaseClientLock when dispatching

https://gerrit.wikimedia.org/r/448113

The fix in the patch above has been on mwmaint1001 for around 15 mins now (I hacked it on there), and checked that the correct locks were being removed by name.
Dispatch seems to have recovered, now just need to back port it, and then we can investigate what actually triggered this in the first place.....

Addshore renamed this task from Wikidata dispatching stuck to Wikidata dispatching stuck (not releasing lockmanager locks).Jul 26 2018, 7:43 PM
Addshore moved this task from Incoming to Peer Review on the Wikidata-Campsite board.
Addshore moved this task from Unsorted 💣 to Active 🚁 on the User-Addshore board.

Change 448113 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.14] Use getClientLockName value for releaseClientLock when dispatching

https://gerrit.wikimedia.org/r/448113

Mentioned in SAL (#wikimedia-operations) [2018-07-26T20:39:50Z] <addshore@deploy1001> Synchronized php-1.32.0-wmf.14/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php: Use getClientLockName value for releaseClientLock when dispatching T200420 (duration: 00m 57s)

Looks like after battling this all day service has been restored.... although still slightly confused as to how this happened...

image.png (278×1 px, 64 KB)

Change 448103 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use getClientLockName value for releaseClientLock when dispatching

https://gerrit.wikimedia.org/r/448103