When dispatching changes from test.wikidata.org to test.wikipedia.org, use redis-based locking.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T108944 [Epic] Improve change dispatching | |||
Resolved | Ladsgroup | T151681 DispatchChanges: Avoid long-lasting connections to the master DB | |||
Resolved | Ladsgroup | T159826 Use redis-based lock manager in dispatch changes in production | |||
Resolved | hoo | T159828 Use redis-based lock manager for dispatchChanges on test.wikidata.org |
Event Timeline
Sorry about this- you are not the only "sufferers" of beta not being a reliable place for testing in a truly distributed fashion- we were just discussing this on IRC. I also support a test on test, and offer my help if I can provide it. Thanks again for working on this.
Change 345387 had a related patch set uploaded (by Daniel Kinzler):
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org
@hoo Can you confirm that we are actually running multiple instances of dispatchChanges.php for test.wikidata.org? Which clients does it dispatch to?
We are starting one instance with --max-time 900 every 15 minutes, so most of the time there should only be one instance running (could be that for brief moments two instances are running, given max-time is not that strictly enforced).
It dispatches to test2wiki, testwiki and testwikidatawiki.
@hoo @aude Are you ok with merging/deploying the config patch?
I'd like to test this as follows:
- stop the dispatch script for a while.
- do lots of edits
- run a bunch of dispatcher instances in parallel
- make sure there are no errors
- somehow (log file?) check that all changes got dispatched exactly once
- return to the normal dispatch schedule, but keep using the redis lock
- after a while, again check that all changes got dispatched exactly once
Does that sound good? How painful is the log/compare stuff, do you think?
Change 346540 had a related patch set uploaded (by Hoo man):
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata
Change 346545 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata
Change 346545 merged by Jcrespo:
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata
Change 346540 merged by jenkins-bot:
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata
Change 345387 merged by jenkins-bot:
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org
Mentioned in SAL (#wikimedia-operations) [2017-04-06T16:17:19Z] <hoo@tin> Synchronized wmf-config/Wikibase-production.php: Try using redisLockManager for test.wikidata.org (T159828) (duration: 00m 39s)
We deployed this yesterday and I can confirm it to work correctly.
I tested this by stopping the regular dispatcher, enabling very verbose logging and then starting 3 dispatcher with --batch-size 1 --dispatch-interval 0 (making them very fast and "race" for changes). Afterwards I dispatch several changes from testwikidata to both testwiki and test2wiki and confirmed for them to properly arrive there in order.
The dispatchers ran into locks every now and then, so I assume the locking to be effective.
Thank you @hoo!
One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.
We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?
As a small side note- that can also happen on mysql. Despite locks being released on session disconnection, there has been some occasions where the mysql session is not killed (it continuous), but the thread on mediawiki has been. There are several known bugs about that like: T97192, where a connector with issues doesn't terminate the MySQL session cleanly. If there is a timeout that works now, that is better than what we have now :-)
The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.
The highest times I can currently see are 11s (and even that surprises me… something must be really odd there). So going for 10 times that (2 minutes) should be a rather safe bet.
In case we switch towards waiting for the slaves on the lock, we can also take the current replag into account.
Change 346788 had a related patch set uploaded (by Hashar; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"
Change 346788 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"
Mentioned in SAL (#wikimedia-operations) [2017-04-12T13:08:30Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Revert "Temporarily enable change dispatch logging on testwikidata" - T159828 (duration: 00m 47s)