Create a workflow to run Scholia based on dumps
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Daniel_Mietchen
	Apr 5 2023, 11:10 AM

Description

Scholia currently runs on the generic Wikidata SPARQL endpoint, which causes some of the queries to fail.

It would thus be useful to

set up a Scholia instance that can work with Wikidata dumps
automate that setup to
- use the latest dump by default
- allow users to select a specific dump to base visualizations on

Related Objects

Mentioned Here: T263110: Investigate the cause of: ChecksumError: offset=517789868032,nbytes=16,expected=-58390144,actual=535102966 while importing wikidata dumps
T323096: WDQS Data Reload
T335067: Epic: Wikidata Query Service stabilization

Event Timeline

Daniel_Mietchen created this task.Apr 5 2023, 11:10 AM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptApr 5 2023, 11:10 AM

ArielGlenn subscribed.Apr 5 2023, 11:44 AM

Tonight I'm bringing up the topic of getting a proper Wikidata mirror running in the https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours - may be you can attend to give background info on what you imagine. I could already offer to use our wikidata machine at RWTH Aachen as an endpoint where we do our "Get your own copy of wikidata" experiments.

Lydia_Pintscher added a project: Wikidata-Query-Service.Apr 6 2023, 10:05 AM

Lydia_Pintscher moved this task from incoming to in progress on the Wikidata board.

Lydia_Pintscher moved this task from in progress to monitoring on the Wikidata board.

srishakatux moved this task from Backlog to Hacking projects on the Wikimedia-Hackathon-2023 board.Apr 10 2023, 10:48 AM

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Apr 11 2023, 9:07 AM

It is unclear to me what "running from dumps" mean. WDQS is already running from dumps, in the sense that the dumps are loaded into the Blazegraph instance backing WDQS, and then updated on the fly. Loading dumps currently takes multiple weeks when we are lucky, and multiple months in cases of repeated errors (see T323096 or T263110), so just loading from dumps is prohibitively expensive in the current technical context. It is also unclear how this would help queries not fail.

An approach that might be more successful is to provide an endpoint that only contains a subset of data that is relevant for the Scholia project. Working with a smaller graph is more likely to help queries complete in a timely manner. This is an approach that we want to investigate as part of T335067.

I'm closing this for now. Feel free to reopen if you think I'm missing the point.

Gehel closed this task as Declined.May 1 2023, 8:29 AM

Please reopen - it's linked to the get your own copy of wikidata Wikimedia-Hackathon task and might renaming and stating the concrete goals. E.g. for the subsetting this is linked to https://github.com/ad-freiburg/qlever/issues/859.
So having different (and not time limited endpoints) support a subset of Wikidata is actually one of the issues. See also https://cr.bitplan.com/index.php/List_of_Queries where a similar set of queries is needed (but the timeout issue is not a s big)

ArielGlenn moved this task from Other teams to Done on the Dumps-Generation board.Jun 22 2023, 4:56 AM

Create a workflow to run Scholia based on dumpsClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Create a workflow to run Scholia based on dumps
Closed, DeclinedPublic
Actions