Page MenuHomePhabricator

[Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Closed, DuplicatePublic

Description

As we add structured data on Commons, the next natural step would be to create a query service, akin to query.wikidata.org, to query this data. This requires figuring out:

  • Will this be separate service or the data will be residing inside the same service?
  • If separate, how will it be hosted? What the UI would be?
  • RDF dumps for commons data
  • Does RDF support all kinds of data we need for commons properly?
  • Checking that WDQS tools can be easily changed to work with non-wikidata store (there were assumptions in the past that prevented this, may still happen - like assuming items start with Q, if it's not true for Commons)

We will probably need to create specific subtasks for this as it becomes more clear.

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
DuplicateCparle
ResolvedArielGlenn
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedGehel
ResolvedSmalyshev
ResolvedNone
ResolvedCparle
ResolvedCparle
ResolvedArielGlenn
ResolvedNone
ResolvedSmalyshev
ResolvedCparle
Resolved Mathew.onipe
ResolvedGehel
ResolvedEBernhardson
ResolvedEBernhardson
ResolvedGehel
ResolvedGehel
ResolvedGehel
Resolved Zbyszko
Resolved Zbyszko
Resolved Zbyszko
Resolved Zbyszko
Resolved Zbyszko
ResolvedGehel
Resolved Zbyszko
OpenNone
Resolved Zbyszko
Resolvedbd808
ResolvedGehel
DuplicateNone
ResolvedGehel
Resolved Zbyszko

Event Timeline

Smalyshev triaged this task as Medium priority.Jul 29 2016, 7:57 AM

Things to consider from the user perspective:

  • People will want to query the mediainfo data and the data currently in Wikidata together. This should be easy. My assumption is that it is harder if we split it into two endpoints but I might be wrong. A possible query: "Give me all featured/high-quality images of paintings in the Louvre by artists from the Netherlands"
  • We need to consider two different ways of accessing the data. The first one is search (easy, just type 1 or 2 words) and the second one is querying (complex, powerful, SPARQL directly?). Both will need to work on the structured data in mediainfo.

Technically, it is possible to keep both datasets in the same DB. We will have to think about though:

  1. Namespacing:
    1. Will it have the same same ontology? Probably yes.
    2. Will it have the same prefixes? Probably not. Will have to account for that.
  2. Updates - we'll need separate updater
  3. Dump reload - how we can reload dump for Commons without wiping data for Wikidata?

There are a number of ways to have separation in Blazegraph:

  • Namespaces - those seem to be completely disjoint. Cross-querying may be possible via federation, but may be tricky.
  • Graphs - these live in the same namespace, and can be cross-queried, but probably will require some changes so old queries still work, and also probably running Blazegraph in quads mode which may have some restrictions.

Worth considering that all wikis will probably have local mediainfo eventually, as we want to be able to add metadata to images the same way everywhere, and most images cannot be moved to Commons.

Currently most queries on Commons use SQL, like https://quarry.wmflabs.org. I assume this ticket is for SPARQL queries. A very useful feature would be a service that combines the two. For example all the files that use some template that do not have some property set, etc. We could use that to populate new properties based on current template structure.

Jdforrester-WMF renamed this task from [Story] Query service for structured data on commons to [Story] Provide a SPARQL query service for structured data on Commons.Jan 12 2019, 8:04 PM

Here is my use case, just to illustrate the usefulness of having a SPARQL endpoint.

I want to find all Wikidata items that:

  • Are used as a "depicts" value by at least one Wikimedia Commons file.
  • Do NOT have a P18 (image) property yet.

The query would list each Wikidata item, and coalesce the URLs of each file.

That would allow me to manually add relevant P18 values to each of these items, the end goal being to make this depiction selection screen more user-friendly.

Gehel renamed this task from [Story] Provide a SPARQL query service for structured data on Commons to [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Oct 18 2019, 12:46 PM
Gehel moved this task from Incoming to Epics on the Discovery-Search (Current work) board.
Gehel renamed this task from [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Nov 8 2019, 3:57 PM
Gehel renamed this task from [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Nov 8 2019, 3:59 PM

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

You can find the experimental VizQuery version over here. As an example, here is a query that finds all images that have a depicts (P180) -> cat (Q146) statement.

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

Husky, I have never heard of beta endpoint. Is there some documentation for it? All the examples and help pages are for Wikidata.

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

Husky, I have never heard of beta endpoint. Is there some documentation for it or who is running it?

@Jarekt afaik this is only a proof of concept. However, the data does seem to be updated (previously it only had the data from april or so). I don't know who is currently maintaining the beta, but maybe @Lea_Lacroix_WMDE or @Lydia_Pintscher knows.

Husky, I have never heard of beta endpoint. Is there some documentation for it or who is running it?

This endpoint is a complete work in progress, experimental and quite broken at the moment. And that's why it isn't documented yet. The goal is to expose a SPARQL endpoint similar to WDQS, but with the structured data from Commons.

Feel free to play around with it, but expect it to crash randomly, to not be updated and to have various issues. At this point, the last data import was a few month ago. We will provide more regular updates to the data when we have dumps available. Ideally at some point we will have streaming updates and hopefully we will deploy a real production service in the future.

The Search Platform team is in charge of this part of the project.

Yes @Gehel thanks for reply. One use case, I could think of now would be for finding wikidata items IDs which are used in depict and other statements which point to redirects. On wikidata there are some bots that replace redirected IDs with the new IDs, but on SDC there is no way at to moment to run a query to find such IDs.

Gehel renamed this task from [Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project.Apr 30 2020, 7:58 AM