Page MenuHomePhabricator

Wikidata dispatchers should use a LockManager with a short TTL
Closed, ResolvedPublic

Description

Currently the Wikidata dispatchers use the redisLockManager defined in mw-config wmf-config/filebackend.php. This lock manager has the (default) lockTTL of 1 hour in CLI mode, which might be appropriate for some usages, but it's not for ours: We can't have the dispatchers stuck for so long in case something goes wrong.

I suggest to configure a new (redis based) lockmanger either specifically for this, or generally for "short lived tasks" with a lockTTL of maybe 5m.

(This may or may not be the reason behind yesterday's dispatch trouble, I can't tell… although given it lagged behind for such a long time already, it might not).

The lock manager could be defined as:

$wgLockManagers[] = [
        'name'         => 'OUR-redisLockManager',
        'lockTTL'         => 150, // 5m
        'class'        => 'RedisLockManager',
        'lockServers'  => $wmfMasterServices['redis_lock'],
        'srvsByBucket' => [
                0 => $redisLockServers
        ],
        'redisConfig'  => [
                'connectTimeout' => 2,
                'readTimeout'    => 2,
                'password'       => $wmgRedisPassword
        ]
];

Event Timeline

Thanks for filing this @hoo, I was going to write something up about this today!

While trying to get rid of some of the large backlog of changes to dispatch on Thursday I ran a copy of the dispatch script on terbium, however realised that I has incorrect params (I wanted to change them) so Ctrl + C'd out of the script.
As far as I was able to tell this left any open locks from that run of the script in place.

You can see this @ https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?orgId=1&from=1508398594143&to=1508458610316.
@ roughly 21:35 I kill some runs of the script and they leave locks open, even the freshest lag starts to rise at this point.
@ roughly 22:35 the locks TTL expires and the scripts start running again.
It also took me roughly the whole hour to figure out that locker were the issue and figure out how to manually remove the locks through eval.php hence why I couldn't fix my screw-up sooner.

Dispatch changes currently runs with a max time of 540 seconds ( 9 mins ).
We should check if it is possible to set the log TTL in the script run, then we can simply use the --max-time parameter.

@hoo Sounds like we should just go ahead and do this! Don't forget to change dispatchingLockManager accordingly.

Change 395967 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Create a LockManager for WikidataDispatch with short TTL

https://gerrit.wikimedia.org/r/395967

Change 395969 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Use new wikibase dispatch lock manager on wikidatawiki

https://gerrit.wikimedia.org/r/395969

I'm going to first make a change in the dispatch script to check a mediawiki config var to see if it should actually run or not and backport this first.

This will mean that I can:

  • stop dispatching
  • Change the lock manager over
  • enable dispatching

Without trying to coordinate this with changes in operations-puppet.

This is needed otherwise locks will be missed. Old runing scripts will be using the old lock manager and the new ones will get the new lock manager with no locks.

Addshore triaged this task as Medium priority.Jan 23 2018, 7:44 PM

I'm going to first make a change in the dispatch script to check a mediawiki config var to see if it should actually run or not and backport this first.

This has been done (and deployed) by now, so I guess we can move on here whenever we want.

I'm going to first make a change in the dispatch script to check a mediawiki config var to see if it should actually run or not and backport this first.

This has been done (and deployed) by now, so I guess we can move on here whenever we want.

Yup.

Ideally we should schedule downtime / silence the alerts in icinga when we do this.
This was blocked by T195289

I already had some patches prepared for testing this on testwikidata first.
These can be found @ https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/mediawiki-config+branch:master+topic:wd-dispatch-lockmanager

Going to unassign myself for now as @hoo seems to also be looking here!

Change 395967 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata dispatch, Use a LockManager with short TTL for testwikidata

https://gerrit.wikimedia.org/r/395967

Mentioned in SAL (#wikimedia-operations) [2018-07-12T18:32:05Z] <niharika29@deploy1001> Synchronized wmf-config/: Wikidata dispatch, Use a LockManager with short TTL for testwikidata T178652 (duration: 00m 51s)

This is all done for testwikidatawiki now (thanks @Niharika) and the new lock manager works as expected.

I'll follow up and do this for real wikidata in a few days or so (maybe at wikimania)

Some steps that should work and keep this nice and hidden from the users would be:

  • Prep
    • Turn off the alarm for dispatch lag for wikidata
    • Change the maxlag factor for dispatching (as we know we are going to make the maxlag go up (lets not block people...)
  • Stop dispatching
  • Make the config change
  • Start dispatching
    • Revert the patch stopping new script runs
    • Wait for a new script run and verify it is properful functioning
    • Revert the patch stopping new script runs
  • Once dispatch is at normal levels
    • Revert the change to maxlag dispatch factor
    • Turn back on the dispatch alarm

Change 395969 merged by jenkins-bot:
[operations/mediawiki-config@master] Use new wikibase dispatch lock manager on wikidatawiki

https://gerrit.wikimedia.org/r/395969

Mentioned in SAL (#wikimedia-operations) [2018-07-26T13:44:00Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Use new wikibase dispatch lock manager on wikidatawiki T200420 T178652 (duration: 00m 55s)

Now using a lock manager with a timeout of 15 mins.

It looks like today we can see how useful this is.
A lock for a wiki somehow got stuck, but rather than taking 2 hours to recover, the recovery only took 15 mins.

image.png (510×1 px, 170 KB)