Page MenuHomePhabricator

[Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch)
Open, NormalPublic

Description

As we add structured data on Commons, the next natural step would be to create a query service, akin to query.wikidata.org, to query this data. This requires figuring out:

  • Will this be separate service or the data will be residing inside the same service?
  • If separate, how will it be hosted? What the UI would be?
  • RDF dumps for commons data
  • Does RDF support all kinds of data we need for commons properly?
  • Checking that WDQS tools can be easily changed to work with non-wikidata store (there were assumptions in the past that prevented this, may still happen - like assuming items start with Q, if it's not true for Commons)

We will probably need to create specific subtasks for this as it becomes more clear.

Related Objects

StatusAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
ResolvedAbit
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
DuplicateCparle
OpenNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
OpenGehel
ResolvedSmalyshev
OpenNone
OpenCparle
OpenCparle
OpenNone
ResolvedSmalyshev
OpenCparle
OpenMathew.onipe
OpenMathew.onipe

Event Timeline

Restricted Application added subscribers: Poyekhali, Steinsplitter, Aklapper. · View Herald TranscriptJul 29 2016, 7:52 AM
Smalyshev triaged this task as Normal priority.Jul 29 2016, 7:57 AM

Things to consider from the user perspective:

  • People will want to query the mediainfo data and the data currently in Wikidata together. This should be easy. My assumption is that it is harder if we split it into two endpoints but I might be wrong. A possible query: "Give me all featured/high-quality images of paintings in the Louvre by artists from the Netherlands"
  • We need to consider two different ways of accessing the data. The first one is search (easy, just type 1 or 2 words) and the second one is querying (complex, powerful, SPARQL directly?). Both will need to work on the structured data in mediainfo.

Technically, it is possible to keep both datasets in the same DB. We will have to think about though:

  1. Namespacing:
    1. Will it have the same same ontology? Probably yes.
    2. Will it have the same prefixes? Probably not. Will have to account for that.
  2. Updates - we'll need separate updater
  3. Dump reload - how we can reload dump for Commons without wiping data for Wikidata?

There are a number of ways to have separation in Blazegraph:

  • Namespaces - those seem to be completely disjoint. Cross-querying may be possible via federation, but may be tricky.
  • Graphs - these live in the same namespace, and can be cross-queried, but probably will require some changes so old queries still work, and also probably running Blazegraph in quads mode which may have some restrictions.
Tgr added a subscriber: Tgr.Jul 29 2016, 10:06 PM

Worth considering that all wikis will probably have local mediainfo eventually, as we want to be able to add metadata to images the same way everywhere, and most images cannot be moved to Commons.

SandraF_WMF moved this task from Backlog to Stories on the SDC General board.Aug 31 2017, 7:15 PM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:26 PM
Jarekt added a subscriber: Jarekt.Jan 11 2019, 6:03 PM

Currently most queries on Commons use SQL, like https://quarry.wmflabs.org. I assume this ticket is for SPARQL queries. A very useful feature would be a service that combines the two. For example all the files that use some template that do not have some property set, etc. We could use that to populate new properties based on current template structure.

Jdforrester-WMF renamed this task from [Story] Query service for structured data on commons to [Story] Provide a SPARQL query service for structured data on Commons.Jan 12 2019, 8:04 PM
Tpt added a subscriber: Tpt.Jan 28 2019, 2:40 PM
MB-one added a subscriber: MB-one.Apr 24 2019, 10:31 AM
Jheald added a subscriber: Jheald.May 1 2019, 11:20 AM
Abbe98 added a subscriber: Abbe98.May 14 2019, 11:36 AM
Ainali added a subscriber: Ainali.Aug 4 2019, 9:27 PM
Husky added a subscriber: Husky.Aug 16 2019, 11:16 PM
Mmarx added a subscriber: Mmarx.Aug 27 2019, 11:03 AM

Here is my use case, just to illustrate the usefulness of having a SPARQL endpoint.

I want to find all Wikidata items that:

  • Are used as a "depicts" value by at least one Wikimedia Commons file.
  • Do NOT have a P18 (image) property yet.

The query would list each Wikidata item, and coalesce the URLs of each file.

That would allow me to manually add relevant P18 values to each of these items, the end goal being to make this depiction selection screen more user-friendly.

Gehel renamed this task from [Story] Provide a SPARQL query service for structured data on Commons to [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Fri, Oct 18, 12:46 PM
Gehel moved this task from in progress to Epics on the Discovery-Search (Current work) board.
Gehel renamed this task from [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Fri, Nov 8, 3:57 PM
Gehel renamed this task from [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Fri, Nov 8, 3:59 PM
Husky added a comment.Mon, Nov 11, 9:48 AM

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

You can find the experimental VizQuery version over here. As an example, here is a query that finds all images that have a depicts (P180) -> cat (Q146) statement.