Use redis-based lock manager for dispatchChanges on test.wikidata.org
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Mar 7 2017, 1:15 PM

Description

When dispatching changes from test.wikidata.org to test.wikipedia.org, use redis-based locking.

NOTE: This ticket originally asked for the test to be set up on the beta cluster. It changed to ask for the test to be set up on the "test" (staging) sites. This preserves the intent of testing in a live-like environment.

Details

Subject	Repo	Branch	Lines +/-
Revert "Temporarily enable change dispatch logging on testwikidata"	operations/mediawiki-config	master	+0 -6
Try using redisLockManager for test.wikidata.org	operations/mediawiki-config	master	+2 -0
Temporarily enable change dispatch logging on testwikidata	operations/mediawiki-config	master	+6 -0
Temporarily disable the change dispatch cron for testwikidata	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T108944 [Epic] Improve change dispatching
Resolved	Ladsgroup	T151681 DispatchChanges: Avoid long-lasting connections to the master DB
Resolved	Ladsgroup	T159826 Use redis-based lock manager in dispatch changes in production
Resolved	hoo	T159828 Use redis-based lock manager for dispatchChanges on test.wikidata.org

Event Timeline

Ladsgroup created this task.Mar 7 2017, 1:15 PM

Ladsgroup moved this task from Proposed to Doing on the Wikidata-Former-Sprint-Board board.Mar 7 2017, 1:49 PM

Ladsgroup mentioned this in T151993: Implement ChangeDispatchCoordinator based on RedisLockManager.

Ladsgroup moved this task from Incoming to In progress on the User-Ladsgroup board.Mar 9 2017, 10:15 PM

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.Mar 22 2017, 8:00 PM

Lydia_Pintscher moved this task from incoming to in progress on the Wikidata board.Mar 23 2017, 2:29 PM

Beta cluser doesn't have dispatching.

@Ladsgroup thanks for pointing that out. Let's do this on test.wikidata.org, then.

Sorry about this- you are not the only "sufferers" of beta not being a reliable place for testing in a truly distributed fashion- we were just discussing this on IRC. I also support a test on test, and offer my help if I can provide it. Thanks again for working on this.

jcrespo awarded a token.Mar 29 2017, 3:34 PM

Change 345387 had a related patch set uploaded (by Daniel Kinzler):
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org

https://gerrit.wikimedia.org/r/345387

gerritbot added a project: Patch-For-Review.Mar 29 2017, 3:57 PM

@hoo Can you confirm that we are actually running multiple instances of dispatchChanges.php for test.wikidata.org? Which clients does it dispatch to?

daniel claimed this task.Mar 29 2017, 4:00 PM

Ladsgroup moved this task from In progress to Monitor on the User-Ladsgroup board.Mar 30 2017, 8:44 AM

In T159828#3140724, @daniel wrote:

@hoo Can you confirm that we are actually running multiple instances of dispatchChanges.php for test.wikidata.org? Which clients does it dispatch to?

We are starting one instance with --max-time 900 every 15 minutes, so most of the time there should only be one instance running (could be that for brief moments two instances are running, given max-time is not that strictly enforced).

It dispatches to test2wiki, testwiki and testwikidatawiki.

@hoo @aude Are you ok with merging/deploying the config patch?

I'd like to test this as follows:

stop the dispatch script for a while.
do lots of edits
run a bunch of dispatcher instances in parallel
make sure there are no errors
somehow (log file?) check that all changes got dispatched exactly once
return to the normal dispatch schedule, but keep using the redis lock
after a while, again check that all changes got dispatched exactly once

Does that sound good? How painful is the log/compare stuff, do you think?

Deployment of this has been schedule for April 6 14:00–15:30 UTC.

hoo renamed this task from Use redis-based lock manager for dispatchChanges on test sites. to Use redis-based lock manager for dispatchChanges on test.wikidata.org.Apr 5 2017, 12:26 PM

Change 346540 had a related patch set uploaded (by Hoo man):
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata

https://gerrit.wikimedia.org/r/346540

Change 346545 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata

https://gerrit.wikimedia.org/r/346545

Change 346545 merged by Jcrespo:
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata

https://gerrit.wikimedia.org/r/346545

Change 346540 merged by jenkins-bot:
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata

https://gerrit.wikimedia.org/r/346540

Change 345387 merged by jenkins-bot:
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org

https://gerrit.wikimedia.org/r/345387

Mentioned in SAL (#wikimedia-operations) [2017-04-06T16:17:19Z] <hoo@tin> Synchronized wmf-config/Wikibase-production.php: Try using redisLockManager for test.wikidata.org (T159828) (duration: 00m 39s)

We deployed this yesterday and I can confirm it to work correctly.

I tested this by stopping the regular dispatcher, enabling very verbose logging and then starting 3 dispatcher with --batch-size 1 --dispatch-interval 0 (making them very fast and "race" for changes). Afterwards I dispatch several changes from testwikidata to both testwiki and test2wiki and confirmed for them to properly arrive there in order.

The dispatchers ran into locks every now and then, so I assume the locking to be effective.

hoo moved this task from Doing to Done on the Wikidata-Former-Sprint-Board board.Apr 7 2017, 9:24 AM

hoo removed a project: Patch-For-Review.

Thank you @hoo!

One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.

In T159828#3163025, @daniel wrote:

Thank you @hoo!

One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

As a small side note- that can also happen on mysql. Despite locks being released on session disconnection, there has been some occasions where the mysql session is not killed (it continuous), but the thread on mediawiki has been. There are several known bugs about that like: T97192, where a connector with issues doesn't terminate the MySQL session cleanly. If there is a timeout that works now, that is better than what we have now :-)

In T159828#3163026, @hoo wrote:

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.

In T159828#3163156, @daniel wrote:

In T159828#3163026, @hoo wrote:

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.

The highest times I can currently see are 11s (and even that surprises me… something must be really odd there). So going for 10 times that (2 minutes) should be a rather safe bet.

In case we switch towards waiting for the slaves on the lock, we can also take the current replag into account.

Change 346788 had a related patch set uploaded (by Hashar; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"

https://gerrit.wikimedia.org/r/346788

gerritbot added a project: Patch-For-Review.Apr 12 2017, 1:02 PM

Change 346788 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"

https://gerrit.wikimedia.org/r/346788

Mentioned in SAL (#wikimedia-operations) [2017-04-12T13:08:30Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Revert "Temporarily enable change dispatch logging on testwikidata" - T159828 (duration: 00m 47s)

Use redis-based lock manager for dispatchChanges on test.wikidata.orgClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Use redis-based lock manager for dispatchChanges on test.wikidata.org
Closed, ResolvedPublic
Actions

Related Objects
Search...