Page MenuHomePhabricator

Create test script for ChangeDispatchCoordinator
Closed, ResolvedPublic

Description

We need a good way to test the locking mechanism used by the change dispatch coordinator. A standalone test script, of which multiple instances can be run in parallel, should do the trick. Pseudo-code of the main test loop:

forever {
  $state = $coordinator->selectClient();
  $clientId = $state['chd_site'];
  
  if (  !$clientId ) {
    sleep( 1 );
    continue;
  }

  assert( readTestValue( $clientId ) === 0 );
  
  sleep( 1 );
  assert( readTestValue( $clientId ) === 0 );
  writeTestValue( $clientId, 1 );

  sleep( 1 );
  assert( readTestValue( $clientId ) === 1 );

  writeTestValue( $clientId, 0 );

  $state['chd_seen'] += 1;
  $coordinator->releaseClient( $state );
}

readTestValue and writeTestValue can be implemented based on any persistent storage with atomic access - SQL, a file, memcached...

Note that this WILL destroy the state of wb_changes_dispatch, so it should really be done on a dummy table or dummy database. Also note that there should be more instances of this script running than there are client wikis in the wb_changes_dispatch table.

Perhaps a fake in-memory wb_changes_dispatch table with a single entry would be sufficient; The different instances of the test script would not share the table and would be unable to use information in that table to coordinate - but they would still use the locking mechanism, probably even more so.

Event Timeline

@Ladsgroup From looking at the code, the script does not test concurrent access to the lock. Am I missing something?

On a related note: it's good that the test passes with the RedisLockManager. But make sure the test fails with a NullLockManager! Otherwise, it's not testing what we need it to test.

Change 342186 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add testDispatchCoordinator

https://gerrit.wikimedia.org/r/342186

Ladsgroup moved this task from Blocked on others to Done on the User-Ladsgroup board.

I ran it on redis-dispatching-repo with redis as the lock manager. Works as expected?

@Ladsgroup That screenshot does not look right. It's trying to get a lock, but never gets one, and never writes test data. Have you tried with the default dispatcher, for comparison?

I tried with Null and it failed.

Yes. Try with no --lock parameter, so it uses SQL.
It should work, and the output should look quite different from qhat you are getting with redis.

With sql lock it returns LogicException. I'm investigating the redis part and it seems it's a config issue that holds locks for two hours. I cleaned the redis and in the first run, it returned LogicException and then it fails to release the lock (because it asserts and errors out) so the lock stay there until it times out (I changed the config so it times out after one minute we should determine a proper number for production)

The main problem here is the failing assert regardless of the lock type. I investigate and come back to you

Installed and set up the memcache "infra". Now it's working as expected.

(I couldn't find the place I added this var_dump, ignore it :})

With sql lock it returns LogicException.

That *really* should not happen, it essentially means that the code we are running live now has no proper locking, and is causing data corruption. Can you confirm? I cannot reproduce this.

It's the freaking memcached

How? This should work fine with the WAN cache bing memcached, or WAN cache being DB.

I'm investigating the redis part and it seems it's a config issue that holds locks for two hours. I cleaned the redis and in the first run, it returned LogicException and then it fails to release the lock (because it asserts and errors out) so the lock stay there until it times out (I changed the config so it times out after one minute we should determine a proper number for production)

Yea...

Aaron siad in T151993#2854471:

If the owner fatals, the lock will have to expire (the TTL depends on the LockManager instance config and/or whether the context is CLI or web).

The timeout must never be so short that the lock would be released even though the dispatch is still running. One minute seems too low.

We could also try to do this in a smarter way, by recording the PID of the process that holds the lock; If the process is dead, the lock can be killed. I'm not sure this is possible with redis, though.

The main problem here is the failing assert regardless of the lock type. I investigate and come back to you

That's a problem indeed. Failing assertions mean data loss/corruption. I'm particularly worried that this seems to fail even for SQL locking for you.

That *really* should not happen, it essentially means that the code we are running live now has no proper locking, and is causing data corruption. Can you confirm? I cannot reproduce this.

It got fixed when I got the the memcached setup up and running. The test script in function of readTestValue returned null for any value (because the caching setup wasn't working properly) and it failed. Note that it doesn't failover to DB cache because we explicitly make "wfMemcKey". I think this needs to be fixed by making the test script refuse to run when there is no memcached setup is defined but this is a matter test script and not related to the dispatching/coordinator

The timeout must never be so short that the lock would be released even though the dispatch is still running. One minute seems too low.

Any number that works for you is okay for me. I set it one minute so I didn't have to wait when it errors out but even two hours is okay for me in prod.

That's a problem indeed. Failing assertions mean data loss/corruption. I'm particularly worried that this seems to fail even for SQL locking for you.

I highly doubt it's any issue related to the prod settings (or its future settings) as I explained it in above.

To distinguish between setup problems and locking failures, it would probably be useful to show a different error if reading the test value returns null.

Change 344751 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Throw RuntimeException when caching doesn't work in testDispatchCoordinator

https://gerrit.wikimedia.org/r/344751