Page MenuHomePhabricator

Wikidata dump is missing entities
Closed, ResolvedPublic

Description

The sizes of the RDF dump have dropped since

wikidata-20160411-all-BETA.ttl.bz2                 12-Apr-2016 15:21          7055848308
wikidata-20160411-all-BETA.ttl.gz                  12-Apr-2016 12:32          9354187820

wikidata-20160425-all-BETA.ttl.bz2                 26-Apr-2016 14:39          6164319357
wikidata-20160425-all-BETA.ttl.gz                  26-Apr-2016 12:10          8256496024

And some entities are missing, e.g. Q23760660 doesn't seem to be in the new dump, even though https://www.wikidata.org/wiki/Q23760660 exists.

Event Timeline

Dump from 20160411 does have Q23760660. 20160418 does not. JSON dumps do have it, so it is a problem unique to RDF dumps.

I am doing some comparison of the dumps. This should give a count of number of items in the dumps.

zcat wikidata-20160411-all-BETA.ttl.gz | grep 'wdata:Q' | sort >> wikidata-20160411.ttl
zcat wikidata-20160418-all-BETA.ttl.gz | grep 'wdata:Q' | sort >> wikidata-20160418.ttl

wc -l wikidata-20160411.ttl
21956833 wikidata-20160411.ttl

wc -l wikidata-20160418.ttl
17915507 wikidata-20160418.ttl

Looks like 1/5 is missing - one shard? Can we check out the logs on actual dump machine and see maybe there is some error message?

@Smalyshev I am looking...

In dumpwikidatattl-wikidata-20160418-all-BETA-1.log, we have "Processed 5516718 entities."

In dumpwikidatattl-wikidata-20160418-all-BETA-1.log, i see and then i think the script dies:

Processed 1378502 entities.
Exception encountered, of type "LogicException"
[1a1b2025b435d1111cb79805] [no req] LogicException from line 522 of /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/purtle/src/RdfWriterBase.php: Bad transition: 5 -> 11
Backtrace:
#0 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/purtle/src/RdfWriterBase.php(399): Wikimedia\Purtle\RdfWriterBase->state(integer)
#1 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/purtle/src/RdfWriterBase.php(381): Wikimedia\Purtle\RdfWriterBase->say(string)
#2 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/Values/ComplexValueRdfHelper.php(92): Wikimedia\Purtle\RdfWriterBase->a(string, string)
#3 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/Values/QuantityRdfBuilder.php(83): Wikibase\Rdf\Values\ComplexValueRdfHelper->attachValueNode(Wikimedia\Purtle\TurtleRdfWriter, string, string, string, DataValues\QuantityValue)
#4 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/Values/QuantityRdfBuilder.php(57): Wikibase\Rdf\Values\QuantityRdfBuilder->addValueNode(Wikimedia\Purtle\TurtleRdfWriter, string, string, string, DataValues\QuantityValue)
#5 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/DispatchingValueSnakRdfBuilder.php(55): Wikibase\Rdf\Values\QuantityRdfBuilder->addValue(Wikimedia\Purtle\TurtleRdfWriter, string, string, string, Wikibase\DataModel\Snak\PropertyValueSnak)
#6 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/SnakRdfBuilder.php(138): Wikibase\Rdf\DispatchingValueSnakRdfBuilder->addValue(Wikimedia\Purtle\TurtleRdfWriter, string, string, string, Wikibase\DataModel\Snak\PropertyValueSnak)
#7 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/SnakRdfBuilder.php(93): Wikibase\Rdf\SnakRdfBuilder->addSnakValue(Wikimedia\Purtle\TurtleRdfWriter, Wikibase\DataModel\Snak\PropertyValueSnak, string)
#8 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/FullStatementRdfBuilder.php(163): Wikibase\Rdf\SnakRdfBuilder->addSnak(Wikimedia\Purtle\TurtleRdfWriter, Wikibase\DataModel\Snak\PropertyValueSnak, string)
#9 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/FullStatementRdfBuilder.php(143): Wikibase\Rdf\FullStatementRdfBuilder->addStatement(Wikibase\DataModel\Entity\ItemId, Wikibase\DataModel\Statement\Statement, boolean)
#10 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/FullStatementRdfBuilder.php(235): Wikibase\Rdf\FullStatementRdfBuilder->addStatements(Wikibase\DataModel\Entity\ItemId, Wikibase\DataModel\Statement\StatementList)
#11 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Rdf/RdfBuilder.php(400): Wikibase\Rdf\FullStatementRdfBuilder->addEntity(Wikibase\DataModel\Entity\Item)
#12 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/RdfDumpGenerator.php(115): Wikibase\Rdf\RdfBuilder->addEntity(Wikibase\DataModel\Entity\Item)
#13 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php(304): Wikibase\Dumpers\RdfDumpGenerator->generateDumpForEntityId(Wikibase\DataModel\Entity\ItemId)
#14 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php(275): Wikibase\Dumpers\DumpGenerator->dumpEntities(array, integer)
#15 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpEntities.php(178): Wikibase\Dumpers\DumpGenerator->generateDump(Wikibase\Repo\Store\SQL\EntityPerPageIdPager)
#16 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php(108): Wikibase\DumpScript->execute()
#17 /srv/mediawiki/php-1.27.0-wmf.21/maintenance/doMaintenance.php(103): Wikibase\DumpRdf->execute()
#18 /srv/mediawiki/php-1.27.0-wmf.21/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php(143): include(string)
#19 /srv/mediawiki/multiversion/MWScript.php(97): include(string)
#20 {main}

shards 2 + 3 don't have any errors, but I think the amount missing from shard 1 helps explain this.

Aha! Thanks a lot! That seems to be the cause. I'll look into it, probably should be not hard to fix.

We should probably create a new dump tomorrow and then delete the old one. If no one objects, I'll do that tomorrow.

A new dump will be created on Monday anyway, so that might not be necessary.

here is a list of the missing entities:

http://dumps.filbertkm.com/wikidata-missing-rdf.txt

the dump rdf script can work with such a list and maybe is a way to find if there is a specific entity where the script fails.

@aude the script may fail in the middle of the entity, so the entity ID may be in the data but only part of the data with it. Is there any way to test the script on the real data?

@hoo before we create new dump, I think we have to find the bug? Otherwise new dump will be as short as the old one.

Looking into missing IDs, the stream of breakage starts somewhere around:

5871286
5891261
5898695
5940875 <-----
5940884 
5940887
5940891
5940903
5940904
5940907
5940908
5940913

After the marked one, almost every ID in this shard is missing. Preceding Q5940875 in the same shard is Q5940871 but I can't find any anomaly in either from here. Must be some kind of interaction with other entities.

BTW, missing IDs file has a lot of empty lines, is that supposed to be that way?

Mentioned in SAL [2016-04-29T18:12:55Z] <jzerebecki@tin> Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 1 of 2 T133924 (duration: 00m 44s)

Mentioned in SAL [2016-04-29T18:20:27Z] <jzerebecki@tin> Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 1 of 2 T133924 (duration: 00m 29s)

https://www.wikidata.org/wiki/Special:EntityData/Q5940875.ttl?flavor=dump seems to work just fine. So I imagine this is some interaction with dump environment, like deduplication...

Change 286262 had a related patch set uploaded (by Aude):
Backport change to purtle

https://gerrit.wikimedia.org/r/286262

Change 286262 merged by jenkins-bot:
Backport change to purtle

https://gerrit.wikimedia.org/r/286262

when dumping shard 1 with https://gerrit.wikimedia.org/r/286262, I don't get any errors for Q5940875

Huh, strange. Can't think of a way that would happen.

Actually - what happens if you try to dump Q5940875 twice? If it's in the ID list twice, it gets dumped twice, right? That should trigger any issues related to deduplication.

It's nice to see that the patch fixed it, but it would be nice if we could understand what exactly went wrong...

if Q5940875 is listed twice, it gets dumped twice (with the old code)

and no error? huh. then the issue isn't with deduplication.

Extremely weird. I mean I am happy dump is not failing anymore, but I'd still like to know why it was failing in the first place. The fact that the change with state check fixed it confirms that this branch is likely the problem, but I have absolutely no idea *how* this problem is caused...
In the meantime, can we get this patch onto whatever is doing the regular dumps so next weekly dump would not be broken?

@Smalyshev we can offiically deploy the patch on Monday

Great, hopefully before regular dumps start (not sure when they do, I thought on Monday too?).

Change 286411 had a related patch set uploaded (by JanZerebecki):
Don't publish Wikidata dumps if a shared failed

https://gerrit.wikimedia.org/r/286411

Change 286434 had a related patch set uploaded (by JanZerebecki):
Update Wikidata - fix for rdf dumps

https://gerrit.wikimedia.org/r/286434

Change 286434 merged by jenkins-bot:
Update Wikidata - fix for rdf dumps

https://gerrit.wikimedia.org/r/286434

Change 286411 merged by Dzahn:
Don't publish Wikidata dumps if a shard failed

https://gerrit.wikimedia.org/r/286411

hoo assigned this task to aude.
hoo removed a project: Patch-For-Review.

Fixed in latest dump: https://dumps.wikimedia.org/wikidatawiki/entities/20160502/

Also we made sure we wont publish dumps when a shard failed with 3923963f8a30b086bc9a85e6c018ee3480199404, thus I consider this fixed.