Page MenuHomePhabricator

Repeating blank node ids in Wikidata entity RDF dumps
Closed, ResolvedPublic

Description

Wikibase (or purtle?) assigns blank node ids by just counting up: _:genidN (where N > 0).

Due to how we currently create the Wikidata entity dumps (run 6 dumpers in parallel and finally concatenate their output), we are re-using blank node ids in totally unrelated contexts (as each of these dumpers has their own counter). Does this pose a problem? Do we need to use globally unique identifiers?

Event Timeline

@Lucas_Werkmeister_WMDE Brought this up on https://gerrit.wikimedia.org/r/405739 where I'm working on splitting the dumping up even further (so blank node ids would repeat even faster/ more often).

Yes, this looks like a problem. We should be using separate IDs for separate bnodes. Probably should have some kind of initializer for shards that guarantees the spaces are not intersecting. Maybe using multipliers - i.e. the ID would be N * shardCount + shardNumber, so that 1st shard always is genidN % shardNumber == 0, second is genidN % shardNumber == 1 etc.

I think the simplest way to fix this is to add a UUID (or something similar) to the ID prefix (currently, it's just "genid").
BNodeLabeler has a parameter for the prefix in the constructor. RdfWriterFactory will have to be changed to optionally know and set a BNodeLabeler explicitly when creating an RdfWriter. It could even generate a unique prefix per default, so that only bnodes from writers created by the same RdfWriterFactory share a prefix. Or setting the prefix could be left to the dump script.

On the other hand, any RDF client should discard bnode IDs anyway, and treat the same bnode ID from separate input as distinct. I can't see how RDF clients could otherwise accepot arbitrary data. Isn't this actually part of the contract for bnodes?

Ah, I guess this is till a problem when joining dump file shards into a single file.

With UUID the problem is it'd be very hard to test I'm afraid. However, we could just set prefix as genid{$shard}- instead of just genid and that should work I think.

Ah, I guess this is till a problem when joining dump file shards into a single file.

Yes, exactly.

Smalyshev triaged this task as Medium priority.Feb 22 2018, 1:26 AM

Change 413288 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[purtle@master] Add ability to set bnode labaler

https://gerrit.wikimedia.org/r/413288

Change 413290 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Make bnodelabeler be aware of shards

https://gerrit.wikimedia.org/r/413290

Change 413288 merged by jenkins-bot:
[purtle@master] Add ability to set bnode labeler to writer & factory

https://gerrit.wikimedia.org/r/413288

Next step I guess would be to create v1.0.7 for purtle and update version in Wikibase. I don't own the packagist module so I assume it's either @daniel or @thiemowmde.

Change 420636 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[purtle@master] Update release notes

https://gerrit.wikimedia.org/r/420636

@Smalyshev, I am administrator on a lot of Packagist packages, but not on https://packagist.org/packages/wikimedia/purtle. In theory the only thing you need to do is to tag a new v1.0.7 release via git. Packagist should pick this up. If it does not, we might need to ping @daniel.

Change 420636 merged by jenkins-bot:
[purtle@master] Update release notes for version 1.0.7

https://gerrit.wikimedia.org/r/420636

@thiemowmde Packagist packages are not auto-updated from Wikimedia github, afaik. But I took care of it.

Change 421101 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Update purtle to 1.0.7

https://gerrit.wikimedia.org/r/421101

Change 421101 merged by jenkins-bot:
[mediawiki/core@master] Update purtle to 1.0.7

https://gerrit.wikimedia.org/r/421101

Change 423470 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Add integration test for dumpRdf.php --part-id

https://gerrit.wikimedia.org/r/423470

hoo assigned this task to Smalyshev.
hoo removed a project: Patch-For-Review.

Change 413290 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make bnodelabeler be aware of shards

https://gerrit.wikimedia.org/r/413290

Change 423470 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add integration test for dumpRdf.php --part-id

https://gerrit.wikimedia.org/r/423470