Page MenuHomePhabricator

Use redis-based lock manager for dispatchChanges on test.wikidata.org
Closed, ResolvedPublic

Description

When dispatching changes from test.wikidata.org to test.wikipedia.org, use redis-based locking.

NOTE: This ticket originally asked for the test to be set up on the beta cluster. It changed to ask for the test to be set up on the "test" (staging) sites. This preserves the intent of testing in a live-like environment.

Event Timeline

Beta cluser doesn't have dispatching.

daniel renamed this task from Use redis-based lock manager in dispatch changes in beta cluster to Use redis-based lock manager for dispatchChanges on test sites..Mar 29 2017, 3:09 PM
daniel reopened this task as Open.
daniel removed a project: Performance Issue.
daniel updated the task description. (Show Details)

@Ladsgroup thanks for pointing that out. Let's do this on test.wikidata.org, then.

Sorry about this- you are not the only "sufferers" of beta not being a reliable place for testing in a truly distributed fashion- we were just discussing this on IRC. I also support a test on test, and offer my help if I can provide it. Thanks again for working on this.

Change 345387 had a related patch set uploaded (by Daniel Kinzler):
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org

https://gerrit.wikimedia.org/r/345387

@hoo Can you confirm that we are actually running multiple instances of dispatchChanges.php for test.wikidata.org? Which clients does it dispatch to?

@hoo Can you confirm that we are actually running multiple instances of dispatchChanges.php for test.wikidata.org? Which clients does it dispatch to?

We are starting one instance with --max-time 900 every 15 minutes, so most of the time there should only be one instance running (could be that for brief moments two instances are running, given max-time is not that strictly enforced).

It dispatches to test2wiki, testwiki and testwikidatawiki.

@hoo @aude Are you ok with merging/deploying the config patch?

I'd like to test this as follows:

  • stop the dispatch script for a while.
  • do lots of edits
  • run a bunch of dispatcher instances in parallel
  • make sure there are no errors
  • somehow (log file?) check that all changes got dispatched exactly once
  • return to the normal dispatch schedule, but keep using the redis lock
  • after a while, again check that all changes got dispatched exactly once

Does that sound good? How painful is the log/compare stuff, do you think?

Deployment of this has been schedule for April 6 14:00–15:30 UTC.

hoo renamed this task from Use redis-based lock manager for dispatchChanges on test sites. to Use redis-based lock manager for dispatchChanges on test.wikidata.org.Apr 5 2017, 12:26 PM

Change 346540 had a related patch set uploaded (by Hoo man):
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata

https://gerrit.wikimedia.org/r/346540

Change 346545 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata

https://gerrit.wikimedia.org/r/346545

Change 346545 merged by Jcrespo:
[operations/puppet@production] Temporarily disable the change dispatch cron for testwikidata

https://gerrit.wikimedia.org/r/346545

Change 346540 merged by jenkins-bot:
[operations/mediawiki-config@master] Temporarily enable change dispatch logging on testwikidata

https://gerrit.wikimedia.org/r/346540

Change 345387 merged by jenkins-bot:
[operations/mediawiki-config@master] Try using redisLockManager for test.wikidata.org

https://gerrit.wikimedia.org/r/345387

Mentioned in SAL (#wikimedia-operations) [2017-04-06T16:17:19Z] <hoo@tin> Synchronized wmf-config/Wikibase-production.php: Try using redisLockManager for test.wikidata.org (T159828) (duration: 00m 39s)

We deployed this yesterday and I can confirm it to work correctly.

I tested this by stopping the regular dispatcher, enabling very verbose logging and then starting 3 dispatcher with --batch-size 1 --dispatch-interval 0 (making them very fast and "race" for changes). Afterwards I dispatch several changes from testwikidata to both testwiki and test2wiki and confirmed for them to properly arrive there in order.

The dispatchers ran into locks every now and then, so I assume the locking to be effective.

Thank you @hoo!

One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.

Thank you @hoo!

One thing that still worries me is that we don't have a mechanism for stale locks. If a dispatcher dies while holding a lock, that lock will stay until it times out (after 30 minutes, I think). So no changes will be dispatched to the respective client until the lock is gone. I currently see no good way to get around this, though. We could put information (host name and process id) about the lock holder into redis (or into the changes_dispatch table), and then actively check if the process is alive... but I don't know an easy way to do this without setting up another service.

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

As a small side note- that can also happen on mysql. Despite locks being released on session disconnection, there has been some occasions where the mysql session is not killed (it continuous), but the thread on mediawiki has been. There are several known bugs about that like: T97192, where a connector with issues doesn't terminate the MySQL session cleanly. If there is a timeout that works now, that is better than what we have now :-)

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.

We can set a timeout for this LockManager::lock allows that (as third argument). I don't know what would be decent here, maybe 2 or 3 minutes?

The timeout must be longer than the maximum time that the lock could be legitimately held. Otherwise it will time prematurely, while dispatching is still in progress. I'm not sure we can detect premature timeout of the lock. If we can, I'd feel safer in experimenting with low values. If we can't detect this, we have to be very conservative.

The highest times I can currently see are 11s (and even that surprises me… something must be really odd there). So going for 10 times that (2 minutes) should be a rather safe bet.

In case we switch towards waiting for the slaves on the lock, we can also take the current replag into account.

Change 346788 had a related patch set uploaded (by Hashar; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"

https://gerrit.wikimedia.org/r/346788

Change 346788 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Temporarily enable change dispatch logging on testwikidata"

https://gerrit.wikimedia.org/r/346788

Mentioned in SAL (#wikimedia-operations) [2017-04-12T13:08:30Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Revert "Temporarily enable change dispatch logging on testwikidata" - T159828 (duration: 00m 47s)