Page MenuHomePhabricator

Wikibase test builds failing with “ERROR: 0 is not in the dispatch table”
Closed, ResolvedPublic

Description

I’ve seen this in several failed builds now, example (change):

19:49:48 There was 1 error:
19:49:48 
19:49:48 1) Wikibase\Repo\Tests\Store\Sql\LockManagerSqlChangeDispatchCoordinatorTest::testBasicLockManager
19:49:48 === Logs generated by test case
19:49:48 [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"}
19:49:48 ===
19:49:48 ERROR: 0 is not in the dispatch table. [Called from Wikibase\Repo\Store\Sql\SqlChangeDispatchCoordinator::warn in /workspace/src/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php at line 608]
19:49:48 
19:49:48 /workspace/src/includes/debug/MWDebug.php:333
19:49:48 /workspace/src/includes/debug/MWDebug.php:188
19:49:48 /workspace/src/includes/GlobalFunctions.php:1079
19:49:48 /workspace/src/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php:608
19:49:48 /workspace/src/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php:459
19:49:48 /workspace/src/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php:284
19:49:48 /workspace/src/extensions/Wikibase/repo/tests/phpunit/includes/Store/Sql/LockManagerSqlChangeDispatchCoordinatorTest.php:65
19:49:48 /workspace/src/tests/phpunit/MediaWikiIntegrationTestCase.php:426
19:49:48 /workspace/src/maintenance/doMaintenance.php:99

Affected repositories:

It affects both regular test builds and gate-and-submit builds, but only the wikibase-repo-docker job.

Patch-For-Review: https://gerrit.wikimedia.org/r/551873

Event Timeline

0 is the argument to lockClient(), which is called in selectClient() wth keys of $candidates:

$candidates = $this->getCandidateClients();

while ( $candidates ) {
	// pick one
	$k = call_user_func( $this->array_rand, $candidates );
	$wiki = $candidates[ $k ];
	unset( $candidates[$k] );

	// lock it
	$state = $this->lockClient( $wiki );

In getCandidateClients(), the return value ultimately comes from IDatabase::selectFieldValues(), implying that we have some wb_changes_dispatch rows where the chd_site is actually 0. I’m not sure if that’s correct.

I haven’t found any recent changes in Wikibase that sound relevant. However, the MediaWiki core change [“Have Database::addQuotes() pass through bare integers without quoting”](https://gerrit.wikimedia.org/r/551024) was merged very recently, and sounds like it might cause this issue, though I don’t yet see exactly how this would happen.

Change 551618 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] DNM: Revert "Have Database::addQuotes() pass through bare integers without quoting"

https://gerrit.wikimedia.org/r/551618

Change 551619 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] DNM: testing if build still fails

https://gerrit.wikimedia.org/r/551619

[T]he MediaWiki core change [“Have Database::addQuotes() pass through bare integers without quoting”](https://gerrit.wikimedia.org/r/551024) was merged very recently, and sounds like it might cause this issue, though I don’t yet see exactly how this would happen.

Okay, if I revert that change and then make a Wikibase no-op change depend on that (see above @gerritbot comments), then the wikibase-repo-docker job passes again. I still don’t understand how that change to quoting behavior breaks our test, but it seems like the most likely cause at the moment. (I’m going home now, so don’t expect any more investigation from me until tomorrow ☺)

In getCandidateClients(), the return value ultimately comes from IDatabase::selectFieldValues(), implying that we have some wb_changes_dispatch rows where the chd_site is actually 0. I’m not sure if that’s correct.

It looks like there is such a row. From the debug log of the build you referenced:

INSERT OR IGNORE INTO unittest_wb_changes_dispatch (chd_site,chd_db,chd_seen,chd_touched,chd_lock,chd_disabled) VALUES (0,'foowiki',0,'00000000000000',NULL,0)

In the test:

$coordinator->initState( [ 'foowiki' ] );

The parameter is supposed to be "an associative array mapping client wiki IDs to client wiki (logical) database names", so passing ['foowiki'] causes the wiki ID to be 0.

I note that the test is running SQLite, is there a reason for doing that? I couldn't reproduce the error with SQLite locally, but I'm not sure what version was used in the test. The addQuotes() change causes an integer instead of a string to be inserted into chd_site. Then the SELECT is later done by comparing chd_site with a string: chd_site='0'. You can imagine SQLite getting confused about that.

It would probably work to either cast $siteID to a string in initState(), or to have the test call initState() correctly in the first place, say

$coordinator->initState( [ 'foowiki' => 'foowiki' ] );

FWIW, the failure also materializes in some Travis CI jobs, e.g. https://travis-ci.org/wikimedia/mediawiki-extensions-Wikibase/jobs/613897002

Notably, it only seems to break the sqlite jobs in this case as well.

Change 551808 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Use string client wiki ID

https://gerrit.wikimedia.org/r/551808

Hm, it still fails when calling initState() with [ '0' => 'foowiki' ], so I don’t think casting $siteID to a string in initState() would change anything. There’s some confusion between strings and numbers somewhere…

Hm, it still fails when calling initState() with [ '0' => 'foowiki' ], so I don’t think casting $siteID to a string in initState() would change anything. There’s some confusion between strings and numbers somewhere…

PHP auto casts these to ints:

php > var_dump(array_keys(["1" => null])[0]);
int(1)

I added some debug output here.

I couldn't reproduce the error with SQLite locally, but I'm not sure what version was used in the test.

The Jenkins build uses 3.16.2, released 2017-01-06. Not sure what version the Travis builds use.

The output from this snippet is interesting:

wfDebug( 'change dispatch row: ' . var_export( $dbr->selectRow(
	$this->stateTable,
		'*',
		[],
		__METHOD__
), true ) );

wfDebug( 'conditions: ' . var_export( [ 'chd_site' => $siteID ], true ) );

// get client state
$state = $dbr->selectRow(
	$this->stateTable,
	[ 'chd_site', 'chd_db', 'chd_seen', 'chd_touched', 'chd_lock', 'chd_disabled' ],
	[ 'chd_site' => $siteID ],
	__METHOD__
);

if ( !$state ) {
	$this->warn( "ERROR: $siteID is not in the dispatch table." );
	return false;
}
14:16:39 [wfDebug] [debug] change dispatch row: stdClass::__set_state(array(
14:16:39    'chd_site' => '0',
14:16:39    'chd_db' => 'foowiki',
14:16:39    'chd_seen' => '0',
14:16:39    'chd_touched' => '00000000000000',
14:16:39    'chd_lock' => NULL,
14:16:39    'chd_disabled' => '0',
14:16:39 )) {"private":false}
14:16:39 [wfDebug] [debug] conditions: array (
14:16:39   'chd_site' => '0',
14:16:39 ) {"private":false}
14:16:39 ===
14:16:39 ERROR: 0 is not in the dispatch table. [Called from Wikibase\Repo\Store\Sql\SqlChangeDispatchCoordinator::warn in /workspace/src/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php at line 621]

So selectRow with no conditions returns a row having 'chd_site' => '0', and immediately afterwards a selectRow with condition 'chd_site' => '0' fails to return a row. The 0 is a string, not an int, in both cases. The query text for the second query also contains '0' as a string:

SELECT  chd_site,chd_db,chd_seen,chd_touched,chd_lock,chd_disabled  FROM unittest_wb_changes_dispatch    WHERE chd_site = '0'  LIMIT 1

But what is actually stored in the database (SELECT TYPEOF)?

I updated that debug change to make it depend on the MediaWiki core revert; comparing #8812 (without revert) and #8815 (with revert), the debug output is identical: we insert the same data ('chd_site' => 0), the same row is present in the database ('chd_site' => '0'), and the same query text is used for the select (WHERE chd_site = '0'). I don’t understand why it returns a row in one case but not in the other.

But what is actually stored in the database (SELECT TYPEOF)?

Uh.

15:09:47 [wfDebug] [debug] real type of chd_site: 'integer' {"private":false}

How? It’s a VARBINARY field! (BLOB in sqlite, I guess, according to DatabaseSqlite::replaceVars().)

How? It’s a VARBINARY field! (BLOB in sqlite, I guess, according to DatabaseSqlite::replaceVars().)

The declared column types in sqlite are (mostly) advisory, it doesn't enforce them. See https://www.sqlite.org/datatype3.html for details.

I think I've determined what's going on here. The critical SQL statements from https://integration.wikimedia.org/ci/job/wikibase-repo-docker/8812/artifact/log/mw-debug-cli.log.gz are these:

CREATE TEMPORARY TABLE "unittest_wb_changes_dispatch" (
 chd_site     BLOB     NOT NULL PRIMARY KEY,  -- client wiki's global site ID
 chd_db       BLOB     NOT NULL,              -- client wiki's logical database name
 chd_seen     INTEGER               NOT NULL DEFAULT 0,    -- last change ID examined (dispatch state)
 chd_touched  BLOB     NOT NULL DEFAULT "00000000000000", -- end of last dispatch pass (informative)
 chd_lock     BLOB              DEFAULT NULL, -- name of global lock (dispatch state)
 chd_disabled INTEGER   NOT NULL DEFAULT 0     -- flag for temporarily disabling a client
 );

INSERT OR IGNORE INTO unittest_wb_changes_dispatch (chd_site,chd_db,chd_seen,chd_touched,chd_lock,chd_disabled) VALUES (0,'foowiki',0,'00000000000000',NULL,0);
SELECT  chd_site,chd_db,chd_seen,chd_touched,chd_lock,chd_disabled  FROM unittest_wb_changes_dispatch    WHERE chd_site = '0'  LIMIT 1;

But we can simplify the test case a bit:

CREATE TABLE test (chd_site BLOB);
INSERT INTO test VALUES (0);
SELECT * FROM test WHERE chd_site = '0';

Because chd_site is a BLOB column, the insert of integer-0 stores integer-0.

Then, for the SELECT with chd_site = '0', the chd_site has "BLOB" type affinity (second bullet) and the '0' has no type affinity (last bullet), and so it does no type conversion before comparison. The comparison between integer-0 (in the table) with string-0 (literal in the query) evaluates as false, so the row isn't returned.

One solution would be to not insert integer-0 into the table in the first place, either by adding a string cast at SqlChangeDispatchCoordinator.php line 404 or by using a key PHP won't convert to integer when calling initState().

Another solution would be to revert only the change to DatabaseSqlite.php, so for sqlite only we continue quoting integers in all cases.

Thanks @Anomie, that makes sense.

IMO it's probably best to revert the change for SQLite only, this seems fairly clunky and probably has a good chance of breaking installations (with old data!) in the wild.

I think we should add a string cast in SqlChangeDispatchCoordinator – regardless of what the database implementation does, we probably shouldn’t rely on type conversion here. (I’d like it even more if we could require strings and throw an error otherwise, but due to PHP’s automatic conversion of array keys pointed out in T238575#5675061, that doesn’t seem realistic.)

That said, reverting the change to DatabaseSqlite also sounds like a good idea to me, though I don’t fully understand the consequences it would have… but I certainly hope we have no code that actually makes use of SQLite’s dynamic type system?

but I certainly hope we have no code that actually makes use of SQLite’s dynamic type system?

It's pretty unlikely unless as a hack, since general code taking advantage of it would fail on MySQL and PostgreSQL. My guess is that it's more likely to break things as it did here, particularly since it uses a transformed MySQL schema and so likely uses BLOB for almost all fields intended to hold strings.

Hacks would probably be findable by looking for code that's checking $wgDBtype or $lb->getServerType( ... ) or the like against 'sqlite', if someone wants to look.

Change 551873 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] WIP: Fix numeric site IDs in dispatch coordinator

https://gerrit.wikimedia.org/r/551873

Change 551808 abandoned by Lucas Werkmeister (WMDE):
Use string client wiki ID debug code

https://gerrit.wikimedia.org/r/551808

Change 551618 abandoned by Lucas Werkmeister (WMDE):
DNM: Revert "Have Database::addQuotes() pass through bare integers without quoting"

https://gerrit.wikimedia.org/r/551618

Change 551619 abandoned by Lucas Werkmeister (WMDE):
DNM: testing if build still fails

https://gerrit.wikimedia.org/r/551619

Change 551873 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix numeric site IDs in dispatch coordinator

https://gerrit.wikimedia.org/r/551873

Lucas_Werkmeister_WMDE lowered the priority of this task from High to Medium.Nov 19 2019, 7:04 PM

This is resolved for Wikibase as far as I can tell; leaving the task open for the moment, for discussion of potentially reverting the core change for SQLite.