When dispatching changes from test.wikidata.org to test.wikipedia.org, use redis-based locking.
|Open||None||T108944 [Epic] Improve change dispatching|
|Resolved||Ladsgroup||T151681 DispatchChanges: Avoid long-lasting connections to the master DB|
|Resolved||Ladsgroup||T159826 Use redis-based lock manager in dispatch changes in production|
|Resolved||hoo||T159828 Use redis-based lock manager for dispatchChanges on test.wikidata.org|
Sorry about this- you are not the only "sufferers" of beta not being a reliable place for testing in a truly distributed fashion- we were just discussing this on IRC. I also support a test on test, and offer my help if I can provide it. Thanks again for working on this.
We are starting one instance with --max-time 900 every 15 minutes, so most of the time there should only be one instance running (could be that for brief moments two instances are running, given max-time is not that strictly enforced).
It dispatches to test2wiki, testwiki and testwikidatawiki.
I'd like to test this as follows:
- stop the dispatch script for a while.
- do lots of edits
- run a bunch of dispatcher instances in parallel
- make sure there are no errors
- somehow (log file?) check that all changes got dispatched exactly once
- return to the normal dispatch schedule, but keep using the redis lock
- after a while, again check that all changes got dispatched exactly once
Does that sound good? How painful is the log/compare stuff, do you think?
We deployed this yesterday and I can confirm it to work correctly.
I tested this by stopping the regular dispatcher, enabling very verbose logging and then starting 3 dispatcher with --batch-size 1 --dispatch-interval 0 (making them very fast and "race" for changes). Afterwards I dispatch several changes from testwikidata to both testwiki and test2wiki and confirmed for them to properly arrive there in order.
The dispatchers ran into locks every now and then, so I assume the locking to be effective.
Thank you @hoo!
One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.
As a small side note- that can also happen on mysql. Despite locks being released on session disconnection, there has been some occasions where the mysql session is not killed (it continuous), but the thread on mediawiki has been. There are several known bugs about that like: T97192, where a connector with issues doesn't terminate the MySQL session cleanly. If there is a timeout that works now, that is better than what we have now :-)
The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.
The highest times I can currently see are 11s (and even that surprises me… something must be really odd there). So going for 10 times that (2 minutes) should be a rather safe bet.
In case we switch towards waiting for the slaves on the lock, we can also take the current replag into account.