Page MenuHomePhabricator

Include information about Wikidata dumps in Wikidata query service
Closed, ResolvedPublic

Description

The machine readable description of Wikidata dumps in DCAT-AP is only provided as RDF/XML and poorly documented, limiting its usefulness. Please automatically import the RDF file into Wikidata Query Service so we can get a list of current dumps for instance with This SPARQL query:

PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?url ?date ?size WHERE {
  <https://www.wikidata.org/about#catalog> dcat:dataset ?dump .
  ?dump dcat:distribution [
    dct:format "application/json" ;
    dcat:downloadURL ?url ;
    dct:issued ?date ;
    dcat:byteSize ?size 
  ] .
}

The only open question is whether to keep information about dumps removed from https://dumps.wikimedia.org/wikidatawiki/entities/. I don't this so but DCAT information from other dump hosters such as Internet Archive (see their list of Wikdata dumps should be included as well.

Event Timeline

nichtich created this task.Oct 25 2017, 9:33 AM
Restricted Application added projects: Discovery, Internet-Archive. · View Herald TranscriptOct 25 2017, 9:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
nichtich updated the task description. (Show Details)Oct 25 2017, 9:33 AM
nichtich updated the task description. (Show Details)

There's a bit of a problem here because technically WDQS dataset is not the same as any dump. It is live-updated, unlike dump that is updated once a week, and WDQS data set could be loaded from dump months ago and live-updated sine then, so WDQS has no way of knowing if there are any dumps that happened after that and where they can be found. We could just import https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf but I am not sure how useful that would be and what is the difference between that and just having it for downloading. Is it just because we could query it over SPARQL if we load it?

Yes, being able to query the information from dcatap would increase its usability a lot because WDQS is integrated in the Wikidata tool ecosystem. Having to download, parse, and evaluate the RDF file on your own requires RDF technology. There is no statement that WDQS content and described dumps are from the same date so I don't understand the problem. I think dcatap could be added and updated as named graph. Maybe this is related to Wikistats, I would also welcome a dedicated SPARQL endpoint with information about dumps and statistics - this endpoint could be included into WDQS via federated queries but I don't want to open a can of worms if there is a simple solution.

Smalyshev triaged this task as Normal priority.Nov 4 2017, 12:57 AM
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Dec 5 2017, 7:58 PM
Smalyshev moved this task from Next to Waiting/Blocked on the User-Smalyshev board.Dec 7 2017, 8:09 PM

Change 399954 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] [WIP] Add loading DCAT-AP data into dcatap namespace on WDQS

https://gerrit.wikimedia.org/r/399954

Smalyshev moved this task from Waiting/Blocked to In review on the User-Smalyshev board.

Change 399954 merged by Gehel:
[operations/puppet@production] Add loading DCAT-AP data into dcatap namespace on WDQS

https://gerrit.wikimedia.org/r/399954

Smalyshev updated the task description. (Show Details)Jan 5 2018, 6:44 PM
Smalyshev updated the task description. (Show Details)Jan 5 2018, 6:51 PM
Smalyshev closed this task as Resolved.EditedJan 11 2018, 8:19 PM

The data is now available in dcatap namespace on WDQS. See: https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#DCAT-AP