Page MenuHomePhabricator

Broken output from Munger tool
Closed, ResolvedPublic

Description

Munge tool for dump seems to output wrong data like this:

entity:Q17503118 rdfs:label "316th Division"@en , entity:Q17503118 ;
        schema:version 150541670 ;
        schema:dateModified "2014-08-12T12:06:10Z"^^xsd:dateTime .

As you see, it claims the entity is a label for itself, which is wrong. Seems not to happen always but only for some entities (e.g. doesn't happen for Q1).

Event Timeline

Smalyshev assigned this task to Manybubbles.
Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.

Reproducible with input

and this command line:

java -cp target/wikidata-query-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.tool.Munge --from in.ttl --to out --labelLanguage en --labelLanguage de --singleLabel en --singleLabel de --skipSiteLinks --chunkSize 100000

I think it happens because singleLabelModeWorkForDescription and singleLabelModeWorkForLabel both generate this:

return new StatementImpl(entityUriImpl, new URIImpl(RDFS.LABEL), entityUriImpl);

Which obviously wrong for description and I suspect also wrong for label. So if label or description is missing, this is what happens.

Also I note that German label and description is dropped, even though skos:altLabel is preserved for both en and de. Looks fishy.

Its how I designed single label mode but I now think its stupid. The point
of single label was that you could always get a single label for a thing
and its in one of the languages you ask for.

Instead I think it shouldn't ever add a label or description if there isn't
one in the language.

I think in this case we shouldn't mess with the data. Rather, we'd have something like function bestLabel(item, languages) e.g. bestLabel(wd:Q123, 'en', 'de', 'ru', 'es') which would try to find labels on any language but if not just return something like 'Q123'. The thing is not all queries even need labels... and for those that do it we can not predict what people would actually want there - simple lookup, hierarchy lookup, which languages, etc.

I think throwing out label data that we don't want is ok, but adding is not good as it may be confused with actual data.

I think in this case we shouldn't mess with the data. Rather, we'd have something like function bestLabel(item, languages) e.g. bestLabel(wd:Q123, 'en', 'de', 'ru', 'es') which would try to find labels on any language but if not just return something like 'Q123'. The thing is not all queries even need labels... and for those that do it we can not predict what people would actually want there - simple lookup, hierarchy lookup, which languages, etc.

That'd be nice but the reality right now is that bestLabel would be super slow without some work in blazegraph. The singleLabel option is optional. You could just not specify it and we'd leave the data alone.

I think throwing out label data that we don't want is ok, but adding is not good as it may be confused with actual data.

Yeah. +1. I'll fix it in a few minutes.

Smalyshev moved this task from Incoming to Done on the Wikidata-Query-Service board.