[Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Smalyshev
	Jul 29 2016, 7:52 AM

Description

As we add structured data on Commons, the next natural step would be to create a query service, akin to query.wikidata.org, to query this data. This requires figuring out:

Will this be separate service or the data will be residing inside the same service?
If separate, how will it be hosted? What the UI would be?
RDF dumps for commons data
Does RDF support all kinds of data we need for commons properly?
Checking that WDQS tools can be easily changed to work with non-wikidata store (there were assumptions in the past that prevented this, may still happen - like assuming items start with Q, if it's not true for Commons)

We will probably need to create specific subtasks for this as it becomes more clear.

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T141602 [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Duplicate		Cparle	T194401 Investigate storing commons data in BlazeGraph
Resolved		ArielGlenn	T221917 Create RDF dump of structured data on Commons
Resolved		Smalyshev	T221916 Create RDF export for structured data stored for files
Resolved		Smalyshev	T222299 dumpRdf does not work with MediaInfo entities
Resolved		Smalyshev	T222302 Too many RDF label forms for MediaInfo
Resolved		Smalyshev	T222306 RDF export generates wrong IDs for federated entities
Resolved		Gehel	T222321 Make /entity/ alias work for Commons
Resolved		Smalyshev	T222995 Decide which prefixes to use for MediaInfo RDF
Resolved		None	T230840 Set up proper prefix configuration for RDF export on Commons
Resolved		Cparle	T230856 RDF dump performance for SDC
Resolved		Cparle	T222497 dumpRDF for MediaInfo entities loads each page individually
Resolved		ArielGlenn	T239905 dumpRdf for mediainfo entities loads data from db more often than it needs to
Resolved		None	T221921 Provision search endpoint for SDC. Requirements from Product Team.
Resolved		Smalyshev	T229608 Support SDC URIs in WDQS URI schemes
Resolved		Cparle	T230862 Create a way to filter only WB-related changes from Commons recentchanges
Resolved		• Mathew.onipe	T232297 Refactor Puppet WDQS module to make it usable for wdqs and cqs
Resolved		Gehel	T251488 [epic] Create minimal SPARQL Endpoint for Commons on WMCS
Resolved		EBernhardson	T237089 Create CQS puppet configs by applying query_service module
Resolved		EBernhardson	T251489 Validate that we have enough resources on WMCS for a SPARQL Endpoint for Commons
Resolved		Gehel	T251490 Load data into the SPARQL Endpoint for Commons
Resolved		Gehel	T251496 Validate and fix TTL dumps of SDoC
Resolved		Gehel	T251497 Adapt munging process for SDoC
Resolved		• Zbyszko	T243292 Fix the munger to support commons RDF dump
Resolved		• Zbyszko	T251491 Enable federation between SPARQL Endpoint for Commons and WDQS
Resolved		• Zbyszko	T251498 Access restriction for SPARQL Endpoint for Commons
Resolved		• Zbyszko	T251499 Minimal authentication for SPARQL Endpoint for Commons
Resolved		• Zbyszko	T251500 oAuth authentication for SPARQL Endpoint for Commons
Resolved		Gehel	T251514 UI for SPARQL Endpoint for Commons
Resolved		• Zbyszko	T258625 Querying WCQS should allow me to use prefixes for MediaInfo items
Open		None	T258627 Autocompletion for MediaInfo items on WMCS
Resolved		• Zbyszko	T251515 Automate data reload for SPARQL Endpoint for Commons
Resolved		bd808	T257336 Request increased quota for wikidata-query Cloud VPS project
Resolved		Gehel	T258489 Create a Examples Page for WCQS
Duplicate		None	T259884 Maintenance page during data reloads
Resolved		Gehel	T259543 Notify users that wcqs is reloading its data
Resolved		• Zbyszko	T262828 Near zero downtime Data reload for WCQS

Event Timeline

Smalyshev created this task.Jul 29 2016, 7:52 AM

Restricted Application added subscribers: Poyekhali, Steinsplitter, Aklapper. · View Herald TranscriptJul 29 2016, 7:52 AM

Smalyshev added a parent task: T68108: [Epic] Store media information for files on Wikimedia Commons as structured data.Jul 29 2016, 7:53 AM

Smalyshev triaged this task as Medium priority.Jul 29 2016, 7:57 AM

Things to consider from the user perspective:

People will want to query the mediainfo data and the data currently in Wikidata together. This should be easy. My assumption is that it is harder if we split it into two endpoints but I might be wrong. A possible query: "Give me all featured/high-quality images of paintings in the Louvre by artists from the Netherlands"
We need to consider two different ways of accessing the data. The first one is search (easy, just type 1 or 2 words) and the second one is querying (complex, powerful, SPARQL directly?). Both will need to work on the structured data in mediainfo.

Danny_B added a project: Story.Jul 29 2016, 10:24 AM

The search feature is covered by T76011: commons file search that includes structured data.

Technically, it is possible to keep both datasets in the same DB. We will have to think about though:

Namespacing:
1. Will it have the same same ontology? Probably yes.
2. Will it have the same prefixes? Probably not. Will have to account for that.
Updates - we'll need separate updater
Dump reload - how we can reload dump for Commons without wiping data for Wikidata?

There are a number of ways to have separation in Blazegraph:

Namespaces - those seem to be completely disjoint. Cross-querying may be possible via federation, but may be tricky.
Graphs - these live in the same namespace, and can be cross-queried, but probably will require some changes so old queries still work, and also probably running Blazegraph in quads mode which may have some restrictions.

Worth considering that all wikis will probably have local mediainfo eventually, as we want to be able to add metadata to images the same way everywhere, and most images cannot be moved to Commons.

• SandraF_WMF moved this task from Backlog to Stories on the SDC General board.Aug 31 2017, 7:15 PM

• Ramsey-WMF subscribed.Aug 31 2017, 11:40 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:26 PM

Jheald mentioned this in T194401: Investigate storing commons data in BlazeGraph.Jul 14 2018, 10:55 PM

Currently most queries on Commons use SQL, like https://quarry.wmflabs.org. I assume this ticket is for SPARQL queries. A very useful feature would be a service that combines the two. For example all the files that use some template that do not have some property set, etc. We could use that to populate new properties based on current template structure.

Jdforrester-WMF renamed this task from [Story] Query service for structured data on commons to [Story] Provide a SPARQL query service for structured data on Commons.Jan 12 2019, 8:04 PM

Addshore subscribed.Jan 17 2019, 3:40 PM

Tpt subscribed.Jan 28 2019, 2:40 PM

MB-one subscribed.Apr 24 2019, 10:31 AM

Salgo60 subscribed.Apr 24 2019, 10:17 PM

Jheald added a subtask: T194401: Investigate storing commons data in BlazeGraph.Apr 25 2019, 9:15 AM

Lucas_Werkmeister_WMDE subscribed.Apr 25 2019, 3:23 PM

Smalyshev edited projects, added Wikidata-Query-Service, Epic; removed Story.Apr 25 2019, 6:24 PM

Jheald subscribed.May 1 2019, 11:20 AM

Smalyshev moved this task from Incoming to SDAW on the Wikidata-Query-Service board.May 2 2019, 9:33 PM

ChristianFerrer subscribed.May 3 2019, 6:36 PM

Abbe98 subscribed.May 14 2019, 11:36 AM

JeanFred subscribed.Aug 1 2019, 7:57 PM

VIGNERON subscribed.Aug 2 2019, 6:03 PM

Ainali subscribed.Aug 4 2019, 9:27 PM

Silverfish subscribed.Aug 10 2019, 8:16 PM

Smalyshev mentioned this in T230247: Increase VCPU quota for wikidata-query project.Aug 12 2019, 7:43 AM

jleedev subscribed.Aug 14 2019, 3:00 AM

Husky subscribed.Aug 16 2019, 11:16 PM

Smalyshev closed subtask T221916: Create RDF export for structured data stored for files as Resolved.Aug 19 2019, 7:01 PM

Nemo_bis subscribed.Aug 19 2019, 8:57 PM

Smalyshev closed subtask T229608: Support SDC URIs in WDQS URI schemes as Resolved.Aug 23 2019, 12:07 AM

Multichill subscribed.Aug 26 2019, 7:45 PM

Mmarx subscribed.Aug 27 2019, 11:03 AM

Nicolas_Raoul awarded a token.Aug 28 2019, 8:01 AM

Here is my use case, just to illustrate the usefulness of having a SPARQL endpoint.

I want to find all Wikidata items that:

Are used as a "depicts" value by at least one Wikimedia Commons file.
Do NOT have a P18 (image) property yet.

The query would list each Wikidata item, and coalesce the URLs of each file.

That would allow me to manually add relevant P18 values to each of these items, the end goal being to make this depiction selection screen more user-friendly.

Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.Oct 16 2019, 1:21 PM

Gehel merged a task: T235643: [Epic] Provide a Proof of Concept SPARQL endpoint in support of SDoC project.

Gehel added a subtask: T232297: Refactor Puppet WDQS module to make it usable for wdqs and cqs.

Gehel subscribed.

Gehel renamed this task from [Story] Provide a SPARQL query service for structured data on Commons to [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Oct 18 2019, 12:46 PM

Gehel added a project: Discovery-Search (Current work).

Gehel moved this task from Incoming to Epics on the Discovery-Search (Current work) board.

Gehel renamed this task from [Objective Q2] Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Nov 8 2019, 3:57 PM

Gehel renamed this task from [Objective Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch).Nov 8 2019, 3:59 PM

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

You can find the experimental VizQuery version over here. As an example, here is a query that finds all images that have a depicts (P180) -> cat (Q146) statement.

Gehel removed a project: Discovery-Wikidata-Query-Service-Sprint.Nov 12 2019, 2:07 PM

TJones closed subtask T232297: Refactor Puppet WDQS module to make it usable for wdqs and cqs as Resolved.Nov 20 2019, 4:56 PM

• PDrouin-WMF subscribed.Nov 25 2019, 8:16 PM

In T141602#5652512, @Husky wrote:

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

Husky, I have never heard of beta endpoint. Is there some documentation for it? All the examples and help pages are for Wikidata.

In T141602#5652512, @Husky wrote:

Hey, as a proof of concept, i've adapted my VizQuery tool to use the beta endpoint.

Husky, I have never heard of beta endpoint. Is there some documentation for it or who is running it?

@Jarekt afaik this is only a proof of concept. However, the data does seem to be updated (previously it only had the data from april or so). I don't know who is currently maintaining the beta, but maybe @Lea_Lacroix_WMDE or @Lydia_Pintscher knows.

In T141602#5743030, @Jarekt wrote:

Husky, I have never heard of beta endpoint. Is there some documentation for it or who is running it?

This endpoint is a complete work in progress, experimental and quite broken at the moment. And that's why it isn't documented yet. The goal is to expose a SPARQL endpoint similar to WDQS, but with the structured data from Commons.

Feel free to play around with it, but expect it to crash randomly, to not be updated and to have various issues. At this point, the last data import was a few month ago. We will provide more regular updates to the data when we have dumps available. Ideally at some point we will have streaming updates and hopefully we will deploy a real production service in the future.

The Search Platform team is in charge of this part of the project.

@Gehel, thanks for your response!

• Ramsey-WMF closed subtask T230862: Create a way to filter only WB-related changes from Commons recentchanges as Resolved.Dec 17 2019, 5:29 PM

Yes @Gehel thanks for reply. One use case, I could think of now would be for finding wikidata items IDs which are used in depict and other statements which point to redirects. On wikidata there are some bots that replace redirected IDs with the new IDs, but on SDC there is no way at to moment to run a query to find such IDs.

Gamaliel awarded a token.Mar 3 2020, 5:56 PM

Gamaliel subscribed.

• SandraF_WMF mentioned this in T237989: Create a dataset of past GLAM-Wiki collaborations.Apr 1 2020, 12:12 PM

danshick-wmde subscribed.Apr 27 2020, 2:39 PM

Gehel renamed this task from [Objective Fiscal 19-20/Q2] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project (stretch) to [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project.Apr 30 2020, 7:58 AM

Gehel reopened subtask T221921: Provision search endpoint for SDC. Requirements from Product Team. as Open.Jul 1 2020, 12:47 PM

Gehel closed subtask T221917: Create RDF dump of structured data on Commons as Resolved.Jul 9 2020, 1:41 PM

CBogen closed subtask T221921: Provision search endpoint for SDC. Requirements from Product Team. as Resolved.Aug 10 2020, 2:01 PM

CBogen closed this task as a duplicate of T251488: [epic] Create minimal SPARQL Endpoint for Commons on WMCS.Aug 10 2020, 2:07 PM

Gehel closed subtask T251488: [epic] Create minimal SPARQL Endpoint for Commons on WMCS as Resolved.Oct 6 2020, 2:25 PM

[Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC projectClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

[Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Closed, DuplicatePublic
Actions

Related Objects
Search...