Page MenuHomePhabricator

dumpRDF.php generates a hash-based conceptUri
Closed, ResolvedPublic5 Estimated Story Points

Description

Motivation
A user running their instance of Wikibase may want to dump RDFs from that instance to export it to another system, back it up .. etc

Problem
Running the RDF export command described in the repo

docker-compose exec wikibase php ./extensions/Wikibase/repo/maintenance/dumpRdf.php

conceptUris use hash-based in the ttl that is based on the docker container hash ID. see e.g. output in original description.

Suggested Solution

  • Allow configuring the domain for conceptURIs:

https://phabricator.wikimedia.org/T227643#5320426

The solution is to add #wgServer with the proper value in LocalSettings.php
To do that you need to add an entry such as ${DOLLAR}wgServer = "${SERVER}"; in the LocalSettings.php.template file as described here
and add SERVER as an environment variable in the compose file

  • Check that changes do not break WDQS updater internally
NOTE: The suggested solution above fixed the problem for the author. We want to include this in the image itself so that it is fixed for all future users.

Original Description

Running the RDF export command described in the repo

docker-compose exec wikibase php ./extensions/Wikibase/repo/maintenance/dumpRdf.php

I get a hash-based conceptUri in the ttl that is based on the docker container hash ID. e.g.

Dumping entities of type item, property
Dumping shard 0/1
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology-beta#> .
@prefix wds: <http://80a2076ec8c0/entity/statement/> .
@prefix wdata: <http://80a2076ec8c0/wiki/Special:EntityData/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wdref: <http://80a2076ec8c0/reference/> .
@prefix wdv: <http://80a2076ec8c0/value/> .
@prefix wd: <http://80a2076ec8c0/entity/> .
@prefix wdt: <http://80a2076ec8c0/prop/direct/> .
@prefix wdtn: <http://80a2076ec8c0/prop/direct-normalized/> .
@prefix p: <http://80a2076ec8c0/prop/> .
@prefix ps: <http://80a2076ec8c0/prop/statement/> .
@prefix psv: <http://80a2076ec8c0/prop/statement/value/> .
@prefix psn: <http://80a2076ec8c0/prop/statement/value-normalized/> .
@prefix pq: <http://80a2076ec8c0/prop/qualifier/> .
@prefix pqv: <http://80a2076ec8c0/prop/qualifier/value/> .
@prefix pqn: <http://80a2076ec8c0/prop/qualifier/value-normalized/> .
@prefix pr: <http://80a2076ec8c0/prop/reference/> .
@prefix prv: <http://80a2076ec8c0/prop/reference/value/> .
@prefix prn: <http://80a2076ec8c0/prop/reference/value-normalized/> .
@prefix wdno: <http://80a2076ec8c0/prop/novalue/> .

...

I tried to find some parameter to indicate my subdomain in the dump script but couldn't. any help on how to set this up would be greatly appreciated

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

to clarify, the QueryService syncs data correctly with the subdomain I set on the docker compose file as a namespace.

It is only the dumpRdf.php script that is not using the correct namespace

Jimkont claimed this task.

The solution is to add #wgServer with the proper value in LocalSettings.php

to do that you need to add an entry such as ${DOLLAR}wgServer = "${SERVER}"; in the LocalSettings.php.template file as described here
and add SERVER as an environment variable in the compose file

Addshore added a project: Wikidata-Campsite.
Addshore subscribed.

Re opening as we will want to add this to the Local settings file that comes as part of the image :)

alaa_wmde updated the task description. (Show Details)
alaa_wmde updated the task description. (Show Details)
alaa_wmde updated the task description. (Show Details)

The solution indeed breaks WDQS updater:

wdqs-updater_1     | 17:08:57.869 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 1 changes, from Q2@8@20190904170851|7 to Q2@8@20190904170851|7
wdqs-updater_1     | 17:08:58.060 [update 1] WARN  org.wikidata.query.rdf.tool.Updater - Contained error syncing.  Giving up on Q2
wdqs-updater_1     | org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects:  [http://localhost:8181/entity/Q2, http://localhost:8181/wiki/Special:EntityData/Q2, http://localhost:8181/entity/statement/Q2-be933bd3-4253-89ed-e5d5-dc0a96b5732d].  Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and http://wikibase.svc/entity/
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:941)
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:493)
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:148)
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.rdf.Munger.mungeWithValues(Munger.java:182)
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:371)
wdqs-updater_1     | 	at org.wikidata.query.rdf.tool.Updater.lambda$handleChanges$0(Updater.java:236)
wdqs-updater_1     | 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
wdqs-updater_1     | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
wdqs-updater_1     | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
wdqs-updater_1     | 	at java.lang.Thread.run(Thread.java:748)
wdqs-updater_1     | 17:08:58.111 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2019-09-04T17:08:51Z at (0.0, 0.0, 0.0) updates per second and (0.0, 3.9, 34014.6) milliseconds per second
wdqs-updater_1     | 17:08:58.128 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
Ladsgroup subscribed.

Unfortunately this is bigger than it looks because the proposed solution really breaks WDQS updater rendering WDQS in docker completely useless. I put it back for re-estimation and possible solutions.

What config was that updater running with?
Why is it getting subjects with localhost referenced like http://localhost:8181/entity/Q2 ?

For example on the wikibase registrary, if the wikibase service is accessible on the docker network via its main domain (if not already able to do this see snippet below):

networks:
  default:
    aliases:
     - wikibase.svc
     - wikibase-registry.wmflabs.org

And the updater accesses the site using the full domain:

environment:
  WIKIBASE_HOST: wikibase-registry.wmflabs.org

Updates should work just fine.

Mediawiki will use this domain as the main site domain.
RDF generated will use that
The updater will be expecting that
Blazegraph itself doesn't actually care what data you feed it in the end.

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

So, this issue is covered in https://addshore.com/2019/11/changing-the-concept-uri-of-an-existing-wikibase-with-data/
See the section "Dumping RDF from Wikibase".

I don't think there is anything to change in the docker images here?
Unless we change wgServer to be setable via an env var, but right now the prefered way to do that would be to alter your LocalSettings.

Going to remove the campsite for now, but will keep an eye here for thoughts.

Addshore moved this task from ready to go to monitoring on the Wikidata board.
Addshore moved this task from Backlog to Misc on the Wikibase-Docker-2017+ board.

Thanks @Addshore

I read the post, very good documentation.

The -server option in dumpRdf.php --server http://somFancyNewLocation.foo --output /tmp/rdfOutput seems like the best approach but when I was looking into the issue it was not part of the wikibase API and that is why I resorted to the wgServer option.
I still cannot find it in https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/repo/maintenance/dumpRdf.php, is it just undocumented but working?

Addshore claimed this task.

That's because it is a MediaWiki setting :)

https://m.mediawiki.org/wiki/Manual:$wgServer