Maniphest T194401

Investigate storing commons data in BlazeGraph
Closed, DuplicatePublic
Actions

Description

If we need to use the WDQS (or something analagous to it) for some of our searches (see T194185 T194245 T194255), then we need to store commons data in BlazeGraph

Wikidata already does this, so perhaps we could copy their approach?

Other considerations:

Do we need another instance of BlazeGraph for commons? A Commons Data Query Service? Or would it be appropriate to use the WDQS?
We need to decide how to store data about files (as opposed to wikidata items)
We'd probably need a mechanism to back-populate BlazeGraph with existing commons data

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	Cparle	T191633 Implement searching of 'depicts' on commons
Duplicate	None	T141602 [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Duplicate	Cparle	T194401 Investigate storing commons data in BlazeGraph
		· · ·

Event Timeline

Cparle triaged this task as Medium priority.May 10 2018, 3:03 PM

Cparle created this task.

Cparle updated the task description. (Show Details)May 10 2018, 3:08 PM

Cparle updated the task description. (Show Details)

Cparle mentioned this in T194185: Implement searching of 'depicts' on commons with the 'inscription' qualifier.

Cparle mentioned this in T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier.May 10 2018, 3:12 PM

Cparle mentioned this in T194255: Implement searching of 'depicts' on commons with the 'relative position within image' qualifier.

Cparle updated the task description. (Show Details)May 10 2018, 3:17 PM

• EBjune added a subscriber: Smalyshev.May 10 2018, 4:58 PM

• EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.May 10 2018, 5:08 PM

• Ramsey-WMF moved this task from Untriaged to Next up on the Multimedia board.May 21 2018, 3:11 PM

• Vvjjkkii renamed this task from Investigate storing commons data in BlazeGraph to a7caaaaaaa.Jul 1 2018, 1:10 AM

• Vvjjkkii removed Cparle as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

CommunityTechBot renamed this task from a7caaaaaaa to Investigate storing commons data in BlazeGraph.Jul 2 2018, 5:56 AM

CommunityTechBot assigned this task to Cparle.

CommunityTechBot lowered the priority of this task from High to Medium.

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added a subscriber: Aklapper.

Well of course you're going to copy all the CommonsData statements to Blazegraph.

The community will have a fit if you don't make provision for us to run our own SPARQL queries against the data. See T141602 .

But see also some of the issues noted on T191633#4425281 and the following comment.

Hehe very well then @Jheald I'll make sure it happens

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 19 2018, 7:08 AM

Cparle added a project: Structured Data Engineering.Oct 9 2018, 10:37 AM

I am not sure whether existing service can handle doubling the load with existing infrastructure. We'd need to evaluate this. Setting a separate service is certainly an option, but there might be issues with cross-querying - there is federation of course but performance would not be anywhere near the common database case. So we need to figure out how we would be querying it and what number of data we plan to handle, and evaluate from there, maybe run some tests.

Note also that for searches, there is ElasticSearch index, and if we are talking about search scenario, we may want to consider it first since all searches are done this way now. So we can index "depicts" statements and such in Elastic index and work from there. Again, all depends here on which queries we want to have.

Hey Stas

This is an old ticket, and yes we are now using elastic search for searching everything.

I still think we'll need to store structured data about MediaInfo items in BlazeGraph, however, so power users can run SparQL queries

Presumably (as the original ticket suggests), one option would be to set up a Commons Data Query Service (CDQS) as a distinct endpoint on a new server that had both Wikidata and CommonsData installed in a Blazegraph instance locally. This would allow the existing WDQS service to continue without alteration, while allowing the CDQS service set-up to be optimised, as we learn more about the sort of queries that people want to run.

One further desideratum I would like to suggest is that (IMO) it would be very useful if the CDQS service also had access, ideally as triples, to a representation of the category structure on Commons. I can see a lot of potential interest in querying being done to compare information on how a set of files are categorised (perhaps all the files in a category or group of categories) with CommonsData for that set of files, and what those categories could be understood to represent as described by the statements on sitelinked Wikidata items (where available). Both CommonsData and the Commons categories would benefit from the possibility of easy creation of queries with a view to synchronising information between the two. This would be hugely facilitated if Commons categories could be maintained as RDF entities in the CDQS system, with information on file membership of them; their own category membership; and their sitelinks to Wikidata items.

if Commons categories could be maintained as RDF entities in the CDQS system

Categories are already in WDQS, see: https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories

with information on file membership of them;

This however is not (mostly due to the volume involved and the effort required to keep it updated).

their sitelinks to Wikidata items.

This is of course part of the data on the relevant items in WDQS.

That's nice. I didn't know that service included Commons.

But to be useful for people trying to sync CommonsData statements on files with their category information (in both directions), the category membership per file would be needed; as would near-live updates; and inclusion in the production CDQS gui.

Information about categories on Wikis other than Commons probably not as necessary for this purpose; but access to live Commons category information would be exceedingly valuable (IMO).

Tpt subscribed.Jan 10 2019, 9:02 PM

Smalyshev edited projects, added Wikidata-Query-Service; removed Discovery-Search.Jan 29 2019, 7:23 PM

Smalyshev moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Jan 30 2019, 10:25 AM

Smalyshev moved this task from Watching / Waiting to Need investigation on the Wikidata-Query-Service board.

MarkTraceur edited projects, added Structured Data Engineering (Depicts and other statements on a bicycle); removed Structured Data Engineering.Mar 1 2019, 4:13 PM

• Ramsey-WMF moved this task from Next up to Desired epics on the Multimedia board.Mar 8 2019, 5:16 PM

Cparle mentioned this in T191633: Implement searching of 'depicts' on commons.Mar 27 2019, 10:56 AM

Jheald added a parent task: T141602: [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project.Apr 25 2019, 9:15 AM

Lucas_Werkmeister_WMDE subscribed.Apr 25 2019, 3:24 PM

Smalyshev closed this task as a duplicate of T221921: Provision search endpoint for SDC. Requirements from Product Team..May 2 2019, 9:40 PM