Documenting an item discussed on #wikimedia-search on IRC.
$ ./munge.sh -c 50000 -f /mnt/w/latest-all.ttl.gz -d /mnt/w/munge ... 09:18:55.138 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 14360000 entities at (1303, 1205, 1425) 09:18:57.851 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g' at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440) at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685) at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405) at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227) at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214) at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105) at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)
I attempted to re-run, but encountered the the identical error after the "Processed 14360000 entities" mark.
The file appears to correspond to https://dumps.wikimedia.org/wikidatawiki/entities/20230918/wikidata-20230918-all-BETA.ttl.gz .
wikidata-20230918-all-BETA.ttl.gz 20-Sep-2023 17:03 129294028486
$ ls -al /mnt/w/latest-all.ttl.gz -rwxrwxrwx 1 adam adam 129294028486 Sep 27 05:35 /mnt/w/latest-all.ttl.gz $ zcat /mnt/w/latest-all.ttl.gz | head -60 | tail -30 wikibase:Dump a schema:Dataset, owl:Ontology ; cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ; schema:softwareVersion "1.0.0" ; schema:dateModified "2023-09-18T23:00:01Z"^^xsd:dateTime ; owl:imports <http://wikiba.se/ontology-1.0.owl> . data:Q31 a schema:Dataset ; schema:about wd:Q31 ; schema:version "1972785862"^^xsd:integer ; schema:dateModified "2023-09-12T03:32:26Z"^^xsd:dateTime ; wikibase:statements "837"^^xsd:integer ; wikibase:sitelinks "347"^^xsd:integer ; wikibase:identifiers "185"^^xsd:integer . wd:Q31 a wikibase:Item . <https://it.wikivoyage.org/wiki/Belgio> a schema:Article ; schema:about wd:Q31 ; schema:inLanguage "it" ; schema:isPartOf <https://it.wikivoyage.org/> ; schema:name "Belgio"@it .
https://dumps.wikimedia.org/wikidatawiki/entities/20230918/wikidata-20230918-md5sums.txt says
1380ee8434296b66bfad114de01ac1c2 wikidata-20230918-all.json.gz b714d1321cae990f983721b985946905 wikidata-20230918-all-BETA.ttl.gz 298d892e21de32df6f052d5f8bc424c7 wikidata-20230918-all.json.bz2 4467822c1f93ac1896c706e66b27821a wikidata-20230918-all-BETA.ttl.bz2 07e47420d538d497e7610556b35663a4 wikidata-20230918-all-BETA.nt.gz a913a9712738fcdd4f7f1160d4742151 wikidata-20230918-all-BETA.nt.bz2
I haven't run the checksum or CRC yet as it seems to take a very long time, but may do so after the attempt on a newer dump file completes.
I've kicked off the following with a newer dump and it is at about 61 million records processed so far without issues.
Fetched https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz, starting on my evening of 27-September-2023.
latest-all.ttl.gz 27-Sep-2023 18:58 129419865335
Renamed, checked the size and beginning of the file, and started munging.
$ ls -al /mnt/w/latest-all.downloaded_20230928.ttl.gz -rwxrwxrwx 1 adam adam 129419865335 Sep 28 04:55 /mnt/w/latest-all.downloaded_20230928.ttl.gz $ zcat /mnt/w/latest-all.downloaded_20230928.ttl.gz | head -60 | tail -30 wikibase:Dump a schema:Dataset, owl:Ontology ; cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ; schema:softwareVersion "1.0.0" ; schema:dateModified "2023-09-25T23:00:01Z"^^xsd:dateTime ; owl:imports <http://wikiba.se/ontology-1.0.owl> . data:Q31 a schema:Dataset ; schema:about wd:Q31 ; schema:version "1978999922"^^xsd:integer ; schema:dateModified "2023-09-21T20:00:15Z"^^xsd:dateTime ; wikibase:statements "837"^^xsd:integer ; wikibase:sitelinks "348"^^xsd:integer ; wikibase:identifiers "185"^^xsd:integer . wd:Q31 a wikibase:Item . <https://it.wikivoyage.org/wiki/Belgio> a schema:Article ; schema:about wd:Q31 ; schema:inLanguage "it" ; schema:isPartOf <https://it.wikivoyage.org/> ; schema:name "Belgio"@it . <https://it.wikivoyage.org/> wikibase:wikiGroup "wikivoyage" . <https://zh.wikivoyage.org/wiki/%E6%AF%94%E5%88%A9%E6%97%B6> a schema:Article ; schema:about wd:Q31 ; schema:inLanguage "zh" ; schema:isPartOf <https://zh.wikivoyage.org/> ; $ ./munge.sh -c 50000 -f /mnt/w/latest-all.downloaded_20230928.ttl.gz -d /mnt/w/munge_on_later_data_set
Apparently, the 2023-09-18 .ttl imported okay into HDFS (although haven't yet inspected quality of it yet). So more exploration of https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/dags/import_ttl.py against the the mvn package'd dist/target munge.sh may be in order to see if there are some differences in the processing.
Also, the data-reload.py cookbook uses a -- --skolemize option in its invocation, so it may be worth trying this way to see if the error doesn't surface that way.
Another option is to use a BlazeGraph .jnl database file to bootstrap a fuller local copy of WDQS data without all of this munging. I haven't had luck downloading the .jnl from Cloudflare R2, but @bking has kindly attempted a download (seems to be working), and is looking into what a compressed .jnl might look like (see T344905#9208233). For me the 1 TB file has made it variously at 100-200 GB of data before the connection fails. Resumed connections don't seem to be working, either. I'm considering if I could SCP a file down and have better luck.