Page MenuHomePhabricator

Investigate storing commons data in BlazeGraph
Closed, DuplicatePublic

Description

If we need to use the WDQS (or something analagous to it) for some of our searches (see T194185 T194245 T194255), then we need to store commons data in BlazeGraph

Wikidata already does this, so perhaps we could copy their approach?

Other considerations:

  • Do we need another instance of BlazeGraph for commons? A Commons Data Query Service? Or would it be appropriate to use the WDQS?
  • We need to decide how to store data about files (as opposed to wikidata items)
  • We'd probably need a mechanism to back-populate BlazeGraph with existing commons data

Event Timeline

Cparle triaged this task as Medium priority.May 10 2018, 3:03 PM
Cparle created this task.
Vvjjkkii renamed this task from Investigate storing commons data in BlazeGraph to a7caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii removed Cparle as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from a7caaaaaaa to Investigate storing commons data in BlazeGraph.Jul 2 2018, 5:56 AM
CommunityTechBot assigned this task to Cparle.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Well of course you're going to copy all the CommonsData statements to Blazegraph.

The community will have a fit if you don't make provision for us to run our own SPARQL queries against the data. See T141602 .

But see also some of the issues noted on T191633#4425281 and the following comment.

Hehe very well then @Jheald I'll make sure it happens

I am not sure whether existing service can handle doubling the load with existing infrastructure. We'd need to evaluate this. Setting a separate service is certainly an option, but there might be issues with cross-querying - there is federation of course but performance would not be anywhere near the common database case. So we need to figure out how we would be querying it and what number of data we plan to handle, and evaluate from there, maybe run some tests.

Note also that for searches, there is ElasticSearch index, and if we are talking about search scenario, we may want to consider it first since all searches are done this way now. So we can index "depicts" statements and such in Elastic index and work from there. Again, all depends here on which queries we want to have.

Hey Stas

This is an old ticket, and yes we are now using elastic search for searching everything.

I still think we'll need to store structured data about MediaInfo items in BlazeGraph, however, so power users can run SparQL queries

Presumably (as the original ticket suggests), one option would be to set up a Commons Data Query Service (CDQS) as a distinct endpoint on a new server that had both Wikidata and CommonsData installed in a Blazegraph instance locally. This would allow the existing WDQS service to continue without alteration, while allowing the CDQS service set-up to be optimised, as we learn more about the sort of queries that people want to run.

One further desideratum I would like to suggest is that (IMO) it would be very useful if the CDQS service also had access, ideally as triples, to a representation of the category structure on Commons. I can see a lot of potential interest in querying being done to compare information on how a set of files are categorised (perhaps all the files in a category or group of categories) with CommonsData for that set of files, and what those categories could be understood to represent as described by the statements on sitelinked Wikidata items (where available). Both CommonsData and the Commons categories would benefit from the possibility of easy creation of queries with a view to synchronising information between the two. This would be hugely facilitated if Commons categories could be maintained as RDF entities in the CDQS system, with information on file membership of them; their own category membership; and their sitelinks to Wikidata items.

if Commons categories could be maintained as RDF entities in the CDQS system

Categories are already in WDQS, see: https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories

with information on file membership of them;

This however is not (mostly due to the volume involved and the effort required to keep it updated).

their sitelinks to Wikidata items.

This is of course part of the data on the relevant items in WDQS.

That's nice. I didn't know that service included Commons.

But to be useful for people trying to sync CommonsData statements on files with their category information (in both directions), the category membership per file would be needed; as would near-live updates; and inclusion in the production CDQS gui.

Information about categories on Wikis other than Commons probably not as necessary for this purpose; but access to live Commons category information would be exceedingly valuable (IMO).