Implement ChangeDispatchCoordinator based on RedisLockManager
Closed, ResolvedPublic

Description

Our current implementation of ChangeDispatchCoordinator relies on global MySQL locks, which hogs connections to the master DB, and may lose locks when these connections are reset by a watchdog.

@aaron suggested to use Redis based locks instead.

When implementing ChangeDispatchCoordinator based on RedisLockManager, please take care to closely replicate the semantics of the existing SqlChangeDispatchCoordinator.

daniel created this task.Nov 30 2016, 11:25 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptNov 30 2016, 11:25 AM
Ladsgroup moved this task from Proposed to Doing on the Wikidata-Sprint board.Dec 2 2016, 4:45 PM
Ladsgroup moved this task from Incoming to In progress on the User-Ladsgroup board.Dec 5 2016, 8:39 PM

Change 325640 had a related patch set uploaded (by Ladsgroup):
[very WIP] [Draft] [DNM] [I can't emphasize enough] Add RedisLockSqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/325640

@aaron: Hey, I'm working on this but it seems that RedisLockManager (and LockManagers in general) doesn't have lock lookup or I couldn't find it. Can you clarify if lock managers have something equivalent to Database::lockIsFree ?

aaron added a comment.Dec 7 2016, 3:26 PM

There is no analogous method. Maybe a non-blocking engageClientLock() call can replace the isClientLockUsed() call? Seems like chd_lock is used to determine whether to do a non-blocking check first before a blocking acquisition (which could race anyway, which I guess just effects runtime?).

daniel added a comment.Dec 7 2016, 4:18 PM

Without looking at the code, it seems we should be able to do without Database::lockIsFree. It's nice to do a check first to avoid a more expensive attempt to acquire a lock if we already know that that is likely to fail, but in the end, all we need is an atomic way to try to grab a lock, and then know if we got it.

One thing that is important in this context is how stale locks are handled. Does redis know if the owner of a lock has died? Or an orphaned lock stay around forever? Do locks time out?

aaron added a comment.Dec 7 2016, 5:11 PM

If the owner fatals, the lock will have to expire (the TTL depends on the LockManager instance config and/or whether the context is CLI or web).

Ladsgroup moved this task from Doing to Review on the Wikidata-Sprint board.Dec 8 2016, 10:40 PM
daniel moved this task from Inbox to Push on the User-Daniel board.Jan 5 2017, 4:12 PM
thiemowmde moved this task from Review to Doing on the Wikidata-Sprint board.Jan 16 2017, 11:10 AM
Ladsgroup moved this task from Doing to Review on the Wikidata-Sprint board.Feb 13 2017, 8:10 PM

Change 325640 merged by jenkins-bot:
[mediawiki/extensions/Wikibase] Add LockManagerSqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/325640

Ladsgroup moved this task from Review to Done on the Wikidata-Sprint board.
Ladsgroup closed this task as Resolved.Mar 7 2017, 1:52 PM
daniel reopened this task as Open.Mar 23 2017, 8:42 PM

Reopening until this is tested and confirmed.

Change 344750 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Document dispatchingLockManager option in $wgWBRepoSettings

https://gerrit.wikimedia.org/r/344750

It just occurred to me an extra reason to avoid using a db master- master failover is a relative frequent operation: it will happen every time the master mysql is upgraded, or when there is a datacenter failover (2 of those will happen on April/May)- probably there wasn't any safe guard to avoid issues when that happened.

hoo closed this task as Resolved.Apr 9 2017, 5:48 PM

Closing this particular task (which only holds some of the things we need to do in order to actually improve the situation considerably).