Page MenuHomePhabricator

Create a workflow to run Scholia based on dumps
Closed, DeclinedPublic

Description

Scholia currently runs on the generic Wikidata SPARQL endpoint, which causes some of the queries to fail.

It would thus be useful to

  • set up a Scholia instance that can work with Wikidata dumps
  • automate that setup to
    • use the latest dump by default
    • allow users to select a specific dump to base visualizations on

Event Timeline

Tonight I'm bringing up the topic of getting a proper Wikidata mirror running in the https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours - may be you can attend to give background info on what you imagine. I could already offer to use our wikidata machine at RWTH Aachen as an endpoint where we do our "Get your own copy of wikidata" experiments.

It is unclear to me what "running from dumps" mean. WDQS is already running from dumps, in the sense that the dumps are loaded into the Blazegraph instance backing WDQS, and then updated on the fly. Loading dumps currently takes multiple weeks when we are lucky, and multiple months in cases of repeated errors (see T323096 or T263110), so just loading from dumps is prohibitively expensive in the current technical context. It is also unclear how this would help queries not fail.

An approach that might be more successful is to provide an endpoint that only contains a subset of data that is relevant for the Scholia project. Working with a smaller graph is more likely to help queries complete in a timely manner. This is an approach that we want to investigate as part of T335067.

I'm closing this for now. Feel free to reopen if you think I'm missing the point.

Please reopen - it's linked to the get your own copy of wikidata Wikimedia-Hackathon task and might renaming and stating the concrete goals. E.g. for the subsetting this is linked to https://github.com/ad-freiburg/qlever/issues/859.
So having different (and not time limited endpoints) support a subset of Wikidata is actually one of the issues. See also https://cr.bitplan.com/index.php/List_of_Queries where a similar set of queries is needed (but the timeout issue is not a s big)