Page MenuHomePhabricator

Investigate & design public API, possibly using MQL
Closed, ResolvedPublic

Description

We need to investigate & design a public query API. Goals for this API are:

  • Easy to build queries with a variety of clients (probably JSON) without dealing with low-level issues like escaping.
  • Only expose queries with reasonable resource consumption (CPU and IO, < 1 minute run time?). If this cannot be statically guaranteed, have a way to reliably kill long-running queries, ideally before they consume a lot of resources.
  • Minimal per-request overheads in the API layer (low single-digit ms)

The Freebase API with MQL is already in use and seems to be useful to existing consumers. Some links:

With the announced integration of Freebase data into wikidata maintaining the same API sounds attractive, and could potentially let us reuse existing tools like the query editor.

Another option could be to use an existing higher-level query language like SPARQL, which also has a JSON serialization.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Services.
GWicke set Security to None.
GWicke removed a subscriber: Aklapper.
GWicke subscribed.

Are we considering supporting WDQ API mini-language as the option for the queries or it's not a viable option?

Would exposing Gremlin in some form publicly be an option?

@JanZerebecki Gremlin is basically shell access, since it can run arbitrary Java code. So we can have it for internal purposes, but we need frontend API since we probably won't be comfortable with giving everybody shell access, and sanitizing Gremlin probably would be more complex than handling any simpler language.

Are we considering supporting WDQ API mini-language as the option for the queries or it's not a viable option?

The problem I see with the WDQ language is the need to perform error-prone custom quoting and serialization in clients. For numeric identifiers this might not be so bad, but strings, dates etc will need escaping. JSON basically avoids these issues by letting the JSON serializer / parser deal with it.

We could still compile a more human-friendly DSL like the WDQ language to JSON, but forcing every client to deal with custom escaping and serialization issues does not sound like a great idea for a public API.

Agreed, format like JSON would be much better since everybody knows how to handle it.

How about using the Sirent JSON format? It aims at giving a complete description of a resource, complete with URIs. Granted, it's more verbose than other solutions, but it could be used also to create different pre- and post-parsers for other formats (MQL, WDQ, etc.) due to its completeness.

Has any thought been put into how to combine this with the plans the Wikidata team has for queries?

@JeroenDeDauw: I think it's possible to implement a simple query interface (such as the current wikibase API, as far as I understand it) on top of a more powerful one such as the one we are discussing here. The reverse is not necessarily true.

What is your view on this?

Manybubbles moved this task from Incoming to In Dev/Progress on the Wikidata-Query-Service board.
Manybubbles subscribed.

I think its probably worth implementing a copy of Wikidata Query Service's API as well.

@GWicke: Indeed. I would however not extend this to "every solution that is a good fit for the WMF use case will also be a good fit for those that are on the Wikidata teams roadmap". Keep in mind that the Wikidata team also wants to have complex queries. As far as I can tell, decisions are currently made based on the use case the WMF has, rather than also holding the Wkidata plans into account. I'm getting this impression because I'm not seeing much questions going from your teams side to the Wikidata one. Perhaps I'm missing something?

my experience with MQL has been really lovely, not just for serialization, and symmetry between query/result, but for creating & manipulating queries on the fly- it's amazing being able to do:

my_query['newfilter']= {blah:"blah"}

I'm sure any solution will be great, but being able to do that is sweet.
thanks

As far as I can tell, decisions are currently made based on the use case the WMF has, rather than also holding the Wkidata plans into account. I'm getting this impression because I'm not seeing much questions going from your teams side to the Wikidata one. Perhaps I'm missing something?

This ticket is meant to facilitate a discussion, so I'm hoping for broad participation of all interested parties, especially the Wikidata team.

It would be very helpful if you could point out specific use cases that we should consider & aren't covered yet.

Some other examples of or query languages that were mentioned: Ask (from SMW), Facebook (Graph API or Query Language), Neo4j (cypher; tinkerpop 3 has a similar functionality).

@JanZerebecki: Are there any in that list that you would prefer over MQL? If so, for which reasons?

No, just a list of things that were brought up, when people talked about querying Wikidata. I don't like Ask. Only ever used Ask and Cypher. I think the way Cypher goes about the problem is neat to work with (query by example for graphs). I find MQL a bit unintuitive.

Is MQL a superset of wqd.wmflabs.org ?
How would a recursive query (WITH RECURSIVE in SQL) / query including trasitive properties look like in MQL? Example: everything that is directly or indirectly/recursively subclass of organization. Which means if there is a subclass of connection like charitable org. -> nonprofit org. -> organization and charitable org. is not directly subclass of organization i would still get charitable org. as a result for this query.
How would this look like for instance of and subclass of?

No, just a list of things that were brought up, when people talked about querying Wikidata. I don't like Ask. Only ever used Ask and Cypher. I think the way Cypher goes about the problem is neat to work with (query by example for graphs). I find MQL a bit unintuitive.

At first sight, Cypher looks like it is relatively closely tied to a particular way of modeling things in a graph (as properties vs. edges), which might not be ideal if we'd like to play with indexes, data models and query optimization. It would also require custom parsing / escaping / client libraries.

Is MQL a superset of wqd.wmflabs.org ?

Functionality-wise I believe that it covers pretty much everything wdq does, except for the AROUND predicate. In MQL as exposed by freebase, the best you can currently do is a bounding box with range queries on lat/lon. Adding an AROUND-like operator should not be too hard though. The support around "optional": "forbidden" vs. "prop!=" might be a bit more powerful in MQL.

How would a recursive query (WITH RECURSIVE in SQL) / query including trasitive properties look like in MQL? Example: everything that is directly or indirectly/recursively subclass of organization.

I'm pretty sure that we'll want to flatten the common use cases for recursive queries into indexes:

  • instance of / type hierarchy ('type' in freebase)
  • geolocation containment
  • genealogical information
  • taxonomic information

With this done, you can directly ask for the parent class & automatically match all items that are directly in a sub-class. Similarly, for geolocation you can transitively match anything that's within some region, even if it is defined to be within a sub-region.

MQL does support following edges, but encourages a limited depth of this traversal by requiring the query to spell out the exact structure. It does not allow queries with unspecified traversal depth.

I would really appreciate if there could be something like MQL for accessing Wikidata.
Since Google will get rid of Freebase, and Wikidata is the official successor of the data, many people would appreciate to also have the same, or at least a similar query-language.

Also: the best part about MQL was the online query editor. This made it really easy to use it.

https://www.freebase.com/query

daniel claimed this task.

We have a SPARQL endpoint now.

If there is demand for an alternative query interface, let's discuss that in a new ticket.