Page MenuHomePhabricator

Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492] when trying to get local wikidata copy based on blazegraph running
Closed, DeclinedPublic

Description

https://phabricator.wikimedia.org/T178211 reports a similar error which was marked as invalid at the time.
see http://wiki.bitplan.com/index.php/WikiData_Import_2020-09-11 for a description on what I tried.
I tried to follow the steps of https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
And when doing:

nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de&

The error message below appears

08:56:33.415 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492]
	at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
	at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
	at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
	at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227)
	at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
	at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
	at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
	at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)

Event Timeline

CBogen triaged this task as Medium priority.Sep 28 2020, 3:22 PM
CBogen raised the priority of this task from Medium to Needs Triage.
CBogen moved this task from Incoming to User support on the Wikidata-Query-Service board.
CBogen triaged this task as Medium priority.Sep 28 2020, 3:25 PM

For me it's a showstopper. I'd love to work on https://github.com/somnathrakshit/geograpy3/issues/15 and currently I have the following options to run the SPARQL query for it

  1. WikiData Query Service - it barely does not time out on the query but the SPARQLWrapper chokes on the result: https://github.com/RDFLib/sparqlwrapper/issues/163
  2. Local Wikidata Copy 2020 based on Apache Jena - the query runs multiple hours and last time i tried i didn't know whether the 10 hour proxy timeout would be sufficient
  3. local Wikidata copy 2018 based on blazegraph - thats the one i used - the query takes less than a minute gives me 17.000 differences to the geonames dataset with same ids - i wonder how many of these are just based on the 2 year difference of actuality between the two datasets
  4. Local wikidata copy this issue is about - based on blazegraph - currently not working

I tried debugging the line 595492 mentioned in the error message. See my stackoverflow answer to How to get few lines from a .gz compressed file without uncompressing

gunzip -c latest-all.ttl.gz | awk -v from=595490 -v to=595495 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
595490 	wdt:P1943 <http://commons.wikimedia.org/wiki/Special:FilePath/Indonesia%20Bangka%20Belitung%20Islands%20location%20map.svg> ;
595491 	wdt:P7867 wd:Q84077164 ;
595492 	p:P1036 s:Q1866-24430318-4817-45ba-P1036 s:Q3258818 ;
595493 	wd66-986-0d0ddac20ed29d01ba-P1036 s:Q3258818 ;
595494 	wd66-986-0d0ddac20ed29s://nl.wikiSger ;
595495 	w64"^^xsd:inter:P1://nl.wikiDeng":P7edR:P1:318-s4817-45"2--598196""Bangka Belitpwd:Q8s:qP10366-24B52B-3501-4903-999E-28CDCE6B3CA09d01baqP10366-24B52B-3501-4903-999E-28CDCE6B3CA09s://nl.wikiSger ;

I can't see where the problem is.

Thanks for your patience on this ticket. In the future we would like to be able to make it easier to set up local versions of WDQS, which will address the specific issues in this ticket. I am closing this one for now.

Could you please point to the ticket/project where the progress of the intended improvement of setting up local versions of WDQS is tracked?