Broken output from Munger tool
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Apr 10 2015, 11:13 PM

Description

Munge tool for dump seems to output wrong data like this:

entity:Q17503118 rdfs:label "316th Division"@en , entity:Q17503118 ;
        schema:version 150541670 ;
        schema:dateModified "2014-08-12T12:06:10Z"^^xsd:dateTime .

As you see, it claims the entity is a label for itself, which is wrong. Seems not to happen always but only for some entities (e.g. doesn't happen for Q1).

Event Timeline

Smalyshev created this task.Apr 10 2015, 11:13 PM

Smalyshev assigned this task to • Manybubbles.

Smalyshev raised the priority of this task from to High.

Smalyshev updated the task description. (Show Details)

Smalyshev added a project: Wikidata-Query-Service.

Smalyshev subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 11:13 PM

Smalyshev set Security to None.Apr 10 2015, 11:14 PM

Smalyshev added a project: Discovery-ARCHIVED.

Smalyshev added a subscriber: • Manybubbles.

Reproducible with input

in.ttl1 KBDownload

and this command line:

java -cp target/wikidata-query-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.tool.Munge --from in.ttl --to out --labelLanguage en --labelLanguage de --singleLabel en --singleLabel de --skipSiteLinks --chunkSize 100000

I think it happens because singleLabelModeWorkForDescription and singleLabelModeWorkForLabel both generate this:

return new StatementImpl(entityUriImpl, new URIImpl(RDFS.LABEL), entityUriImpl);

Which obviously wrong for description and I suspect also wrong for label. So if label or description is missing, this is what happens.

Also I note that German label and description is dropped, even though skos:altLabel is preserved for both en and de. Looks fishy.

Its how I designed single label mode but I now think its stupid. The point
of single label was that you could always get a single label for a thing
and its in one of the languages you ask for.

Instead I think it shouldn't ever add a label or description if there isn't
one in the language.

I think in this case we shouldn't mess with the data. Rather, we'd have something like function bestLabel(item, languages) e.g. bestLabel(wd:Q123, 'en', 'de', 'ru', 'es') which would try to find labels on any language but if not just return something like 'Q123'. The thing is not all queries even need labels... and for those that do it we can not predict what people would actually want there - simple lookup, hierarchy lookup, which languages, etc.

I think throwing out label data that we don't want is ok, but adding is not good as it may be confused with actual data.

In T95779#1201191, @Smalyshev wrote:

I think in this case we shouldn't mess with the data. Rather, we'd have something like function bestLabel(item, languages) e.g. bestLabel(wd:Q123, 'en', 'de', 'ru', 'es') which would try to find labels on any language but if not just return something like 'Q123'. The thing is not all queries even need labels... and for those that do it we can not predict what people would actually want there - simple lookup, hierarchy lookup, which languages, etc.

That'd be nice but the reality right now is that bestLabel would be super slow without some work in blazegraph. The singleLabel option is optional. You could just not specify it and we'd leave the data alone.

I think throwing out label data that we don't want is ok, but adding is not good as it may be confused with actual data.

Yeah. +1. I'll fix it in a few minutes.

• Manybubbles moved this task from In progress to Done on the Discovery-ARCHIVED board.Apr 13 2015, 5:00 PM

Smalyshev closed this task as Resolved.Apr 19 2015, 11:56 PM

Smalyshev moved this task from Incoming to Done on the Wikidata-Query-Service board.

Broken output from Munger toolClosed, ResolvedPublicActions

Description

Event Timeline

Broken output from Munger tool
Closed, ResolvedPublic
Actions