Page MenuHomePhabricator

Implement ChangeDispatchCoordinator based on RedisLockManager
Closed, ResolvedPublic

Description

Our current implementation of ChangeDispatchCoordinator relies on global MySQL locks, which hogs connections to the master DB, and may lose locks when these connections are reset by a watchdog.

@aaron suggested to use Redis based locks instead.

When implementing ChangeDispatchCoordinator based on RedisLockManager, please take care to closely replicate the semantics of the existing SqlChangeDispatchCoordinator.

Event Timeline

Change 325640 had a related patch set uploaded (by Ladsgroup):
[very WIP] [Draft] [DNM] [I can't emphasize enough] Add RedisLockSqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/325640

@aaron: Hey, I'm working on this but it seems that RedisLockManager (and LockManagers in general) doesn't have lock lookup or I couldn't find it. Can you clarify if lock managers have something equivalent to Database::lockIsFree ?

There is no analogous method. Maybe a non-blocking engageClientLock() call can replace the isClientLockUsed() call? Seems like chd_lock is used to determine whether to do a non-blocking check first before a blocking acquisition (which could race anyway, which I guess just effects runtime?).

Without looking at the code, it seems we should be able to do without Database::lockIsFree. It's nice to do a check first to avoid a more expensive attempt to acquire a lock if we already know that that is likely to fail, but in the end, all we need is an atomic way to try to grab a lock, and then know if we got it.

One thing that is important in this context is how stale locks are handled. Does redis know if the owner of a lock has died? Or an orphaned lock stay around forever? Do locks time out?

If the owner fatals, the lock will have to expire (the TTL depends on the LockManager instance config and/or whether the context is CLI or web).

Change 325640 merged by jenkins-bot:
[mediawiki/extensions/Wikibase] Add LockManagerSqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/325640

Reopening until this is tested and confirmed.

Change 344750 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Document dispatchingLockManager option in $wgWBRepoSettings

https://gerrit.wikimedia.org/r/344750

It just occurred to me an extra reason to avoid using a db master- master failover is a relative frequent operation: it will happen every time the master mysql is upgraded, or when there is a datacenter failover (2 of those will happen on April/May)- probably there wasn't any safe guard to avoid issues when that happened.

Closing this particular task (which only holds some of the things we need to do in order to actually improve the situation considerably).