Page MenuHomePhabricator

Unable to process a particular wikibase dump using munge.sh (localised namespace name)
Open, MediumPublic

Description

Symptoms:

  • The script uses all CPU for hours, without producing any output.

Steps to reproduce:

  1. Install wdqs using wikibase-docker (version 0.3.10)
  2. docker-compose exec wdqs mkdir -p data/split
  3. time docker-compose exec wdqs curl -L https://nimiarkisto.fi/dumps/nimiarkisto.fi-CC-BY-4.0_2020-09-09.rdf.bz2 -o data/dump.rdf.bz2
  4. time docker-compose exec wdqs ./munge.sh -c 50000 -f data/dump.rdf.bz2 -d data/split -l en,fi,sv -s

I have checked that this is not just slow. With Wikidata Lexemes dump it does output to the log and to the split files. With Nimiarkisto dump I only get:

root@nimiarkisto-qs:~/nimiarkisto-qs# time docker-compose exec wdqs ./munge.sh -c 5000 -f data/dump.rdf -d data/split -l en,fi,sv -s
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
10:03:13.441 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
^C

And the file data/split/wikidump-000000001.ttl.gz contains no output.

Is it possible to enable more verbose logging to debug this further?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I made a copy of the Wikibase wiki, updated it to 1.33 (yay for 22 hour spent on database schema updates), and made a new rdf dump (another 10 hours) and spend half a day in a debugger trying to understand the Munge Java code. It is very complicated, so I did not learn much, other than that parsing seems to work, but somehow the writer doesn't seem to be called to write out processed entities. I am feeling very frustrated now as this is blocking a project I am hoping to complete this year.

Oh and I also tried version 0.3.40 of the wdsq image, but no difference. Why is that not used by default, by the way?

Many thanks to @dcausse who identified two issues:

Localised namespace in wdata prefix:

@prefix wdata: <https://nimiarkisto.fi/wiki/Toiminnot:EntityData/> .

I worked this around with sed 's/Toiminnot:EntityData/Special:EntityData/g'. This should not be necessary and should be fixed.

Other issue was the use of https, for which passing -- -U https://nimiarkisto.fi to munge.sh fixed it.

Nikerabbit renamed this task from Unable to process a particular wikibase dump using munge.sh to Unable to process a particular wikibase dump using munge.sh (localised namespace name).Oct 2 2020, 10:20 AM
Gehel triaged this task as Medium priority.Oct 12 2020, 3:28 PM
Gehel moved this task from Incoming to Small Tasks on the Wikidata-Query-Service board.

How I worked around this in the end:

  1. Set $wgWBRepoSettings['conceptBaseUri'] = 'http://nimiarkisto.fi/entity/';
  2. Explicitly set the conceptUri to http using https://github.com/wmde/wikibase-docker/pull/140
  3. Unlocalise the special page name using this snippet (for Finnish, needs adaptation for other languages):
$wgHooks['GetLocalURL'][] = function ( &$title, &$url, $query ) {
        if ( !$title->isExternal() && $query === '' && $title->getPrefixedText() === 'Toiminnot:EntityData' ) {
                $url = str_replace( '/wiki/Toiminnot:', '/wiki/Special:', $url );
        }
};

Cross-linking https://github.com/wmde/wikibase-docker/pull/140 which actually makes it possible to use http-protocol concept uris in wikibase-docker.

@dcausse is it realistic to expect this bug to be fixed by Wikimedia in the near future? Several institutuions that we are helping with Wikibase adoption are running into it.

I see two ways to fix this:

  • wikibase should always use Special:EntityData and not the localized page name for its RDF output (similar to the workaround suggested)
  • wdqs accepts new options to the munger/updater/loader to indicate the localized version it has to look for

I'm fine either ways, the second option will require wdqs admins to configure an additional option.

@Addshore do you have an opinion on this?

It would be cool if the munger's validation behaviour would be configurable, which could maybe remedy some weird behavior overall, see T274354

@dcausse: will anything bad happen if the URL comparison that is currently failing is removed? And is that applicable to people running their own Wikibase?

In this case, Special:EntityData in the dump is a technical identifier and should not be translated. We should have a way to create dumps in a language agnostic way.

@Addshore : would it be possible to fix this on the Wikibase side? Having an option to standardize dumps, independent of the language?

Addshore added a subscriber: Samantha_Alipio_WMDE.

wikibase should always use Special:EntityData and not the localized page name for its RDF output (similar to the workaround suggested)

In this case, Special:EntityData in the dump is a technical identifier and should not be translated. We should have a way to create dumps in a language agnostic way.

Seems to make sense to me, and sounds like the easiest solution too!
I'll check this with @Samantha_Alipio_WMDE

Change 1010885 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/extensions/Wikibase@master] RDF: use unlocalized name of Special:EntityData in data URIs

https://gerrit.wikimedia.org/r/1010885

Change #1010885 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] RDF: use unlocalized name of Special:EntityData in data URIs

https://gerrit.wikimedia.org/r/1010885