Page MenuHomePhabricator

2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
Closed, ResolvedPublicBUG REPORT

Description

Documenting an item discussed on #wikimedia-search on IRC.

$ ./munge.sh -c 50000 -f /mnt/w/latest-all.ttl.gz -d /mnt/w/munge

...


09:18:55.138 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 14360000 entities at (1303, 1205, 1425)
09:18:57.851 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
org.openrdf.rio.RDFParseException: Expected '.', found 'g'
        at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
        at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
        at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
        at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
        at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
        at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)

I attempted to re-run, but encountered the the identical error after the "Processed 14360000 entities" mark.

The file appears to correspond to https://dumps.wikimedia.org/wikidatawiki/entities/20230918/wikidata-20230918-all-BETA.ttl.gz .

wikidata-20230918-all-BETA.ttl.gz                  20-Sep-2023 17:03        129294028486
$ ls -al /mnt/w/latest-all.ttl.gz
-rwxrwxrwx 1 adam adam 129294028486 Sep 27 05:35 /mnt/w/latest-all.ttl.gz


$ zcat /mnt/w/latest-all.ttl.gz | head -60 | tail -30

wikibase:Dump a schema:Dataset,
                owl:Ontology ;
        cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
        schema:softwareVersion "1.0.0" ;
        schema:dateModified "2023-09-18T23:00:01Z"^^xsd:dateTime ;
        owl:imports <http://wikiba.se/ontology-1.0.owl> .

data:Q31 a schema:Dataset ;
        schema:about wd:Q31 ;
        schema:version "1972785862"^^xsd:integer ;
        schema:dateModified "2023-09-12T03:32:26Z"^^xsd:dateTime ;
        wikibase:statements "837"^^xsd:integer ;
        wikibase:sitelinks "347"^^xsd:integer ;
        wikibase:identifiers "185"^^xsd:integer .

wd:Q31 a wikibase:Item .

<https://it.wikivoyage.org/wiki/Belgio> a schema:Article ;
        schema:about wd:Q31 ;
        schema:inLanguage "it" ;
        schema:isPartOf <https://it.wikivoyage.org/> ;
        schema:name "Belgio"@it .

https://dumps.wikimedia.org/wikidatawiki/entities/20230918/wikidata-20230918-md5sums.txt says

1380ee8434296b66bfad114de01ac1c2  wikidata-20230918-all.json.gz
b714d1321cae990f983721b985946905  wikidata-20230918-all-BETA.ttl.gz
298d892e21de32df6f052d5f8bc424c7  wikidata-20230918-all.json.bz2
4467822c1f93ac1896c706e66b27821a  wikidata-20230918-all-BETA.ttl.bz2
07e47420d538d497e7610556b35663a4  wikidata-20230918-all-BETA.nt.gz
a913a9712738fcdd4f7f1160d4742151  wikidata-20230918-all-BETA.nt.bz2

I haven't run the checksum or CRC yet as it seems to take a very long time, but may do so after the attempt on a newer dump file completes.

I've kicked off the following with a newer dump and it is at about 61 million records processed so far without issues.

Fetched https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz, starting on my evening of 27-September-2023.

latest-all.ttl.gz                                  27-Sep-2023 18:58        129419865335

Renamed, checked the size and beginning of the file, and started munging.

$ ls -al /mnt/w/latest-all.downloaded_20230928.ttl.gz
-rwxrwxrwx 1 adam adam 129419865335 Sep 28 04:55 /mnt/w/latest-all.downloaded_20230928.ttl.gz


$ zcat /mnt/w/latest-all.downloaded_20230928.ttl.gz | head -60 | tail -30

wikibase:Dump a schema:Dataset,
                owl:Ontology ;
        cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
        schema:softwareVersion "1.0.0" ;
        schema:dateModified "2023-09-25T23:00:01Z"^^xsd:dateTime ;
        owl:imports <http://wikiba.se/ontology-1.0.owl> .

data:Q31 a schema:Dataset ;
        schema:about wd:Q31 ;
        schema:version "1978999922"^^xsd:integer ;
        schema:dateModified "2023-09-21T20:00:15Z"^^xsd:dateTime ;
        wikibase:statements "837"^^xsd:integer ;
        wikibase:sitelinks "348"^^xsd:integer ;
        wikibase:identifiers "185"^^xsd:integer .

wd:Q31 a wikibase:Item .

<https://it.wikivoyage.org/wiki/Belgio> a schema:Article ;
        schema:about wd:Q31 ;
        schema:inLanguage "it" ;
        schema:isPartOf <https://it.wikivoyage.org/> ;
        schema:name "Belgio"@it .

<https://it.wikivoyage.org/> wikibase:wikiGroup "wikivoyage" .

<https://zh.wikivoyage.org/wiki/%E6%AF%94%E5%88%A9%E6%97%B6> a schema:Article ;
        schema:about wd:Q31 ;
        schema:inLanguage "zh" ;
        schema:isPartOf <https://zh.wikivoyage.org/> ;


$ ./munge.sh -c 50000 -f /mnt/w/latest-all.downloaded_20230928.ttl.gz -d /mnt/w/munge_on_later_data_set

Apparently, the 2023-09-18 .ttl imported okay into HDFS (although haven't yet inspected quality of it yet). So more exploration of https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/dags/import_ttl.py against the the mvn package'd dist/target munge.sh may be in order to see if there are some differences in the processing.

Also, the data-reload.py cookbook uses a -- --skolemize option in its invocation, so it may be worth trying this way to see if the error doesn't surface that way.

Another option is to use a BlazeGraph .jnl database file to bootstrap a fuller local copy of WDQS data without all of this munging. I haven't had luck downloading the .jnl from Cloudflare R2, but @bking has kindly attempted a download (seems to be working), and is looking into what a compressed .jnl might look like (see T344905#9208233). For me the 1 TB file has made it variously at 100-200 GB of data before the connection fails. Resumed connections don't seem to be working, either. I'm considering if I could SCP a file down and have better luck.

Event Timeline

dr0ptp4kt triaged this task as Medium priority.
dr0ptp4kt moved this task from needs triage to Current work on the Discovery-Search board.
dr0ptp4kt updated Other Assignee, removed: dr0ptp4kt.

Update - the newer dump munged without any problems.

The addshore .jnl (August file) download completed, with use of the Linux tool axel. Working from my memory as I checked on the download on my 1 Gbps connection the first 800 or so GB downloaded over the first 3-4 hours, then (as some Cloudflare connections seemed to fall off) the remaining 400 or so GB took another 18 hours, so total download time was about 22 hours. Next will be to verify that it loads cleanly.

The addshore .jnl (August file) does launch nicely with ./runBlazegraph.sh.

I did manage to run a sha1sum on the older dump where the import had failed.

/mnt/w$ time sha1sum latest-all.ttl.gz
dedad5a589b3a3661a1f9ebb7f1a6bcbce1b4ef2  latest-all.ttl.gz

real    28m47.000s
user    3m21.104s
sys     0m46.825s

$ ls -al latest-all.ttl.gz
-rwxrwxrwx 1 adam adam 129294028486 Sep 27 05:35 latest-all.ttl.gz

It seems like there was a data corruption somewhere in the transfer or persistence to disk or post-download. I don't see this sha1sum anywhere. It's conceivable something went wrong during the course of the sha1sums themselves, but I'm not going to spend more time on this. Just wanted to document for future selves. Just a remark: normally, one would expect that the download would fail if it were in the transfer itself.

I'm going to close this for now given that the later dump munged okay and there seems to be an underlying issue somewhere probably related to file transfer. The `-- --skolemize` will be a thing to consider for any future run, nonetheless.