Page MenuHomePhabricator

Lack of freeform external access to MediaWiki data is a limitation
Closed, DeclinedPublic

Description

Problem
MediaWiki's Action API. Does not allow complex queries beyond the parameters that are provided. For instance, you cannot query by other fields, nor can you change the order, or do aggregation queries. This limitation can create a huge number of issues when clients must request data in inefficient ways or implement a work around.

Examples:
T180153: API:Usercontribs should allow clients to order by timestamp
T180080: MediaWiki API does not support parallel pagination
T180156: Interaction Timeline should make simpler requests to determine if pages have been edited

Example Queries

  • A Paginated response of all revisions (as well as deleted revisions that the requesting user has access to) that are from pages that User:Apples and User:Bananas have ever edited.

Solution
Implement an API specification like SPARQL (spec) or GraphQL (spec, implementations) that allow for complex data queries. This would also resolve T180096.

Work Around
The work-around is to Create Your Own API™ on Toolforge that queries the Replicas. This is an anti-pattern because it forces tools to be created simply because the API is not flexible enough for the use-case. It also forces duplicate work as multiple tools may create similar APIs for similar requests.

If you need data that is not available in the Replicas (i.e. data that requires authentication) then there is no work-around to this issue.

Examples:
T184146: Sub-epic ⚡️ : Create an API service for InteractionTimeline

Event Timeline

The typical solution used when some business logic or data is found to be missing from the Action API would be to implement a new API module to expose the data that is needed as well as to enforce the needed access controls for exposing the data.

Exposing the entire data model via some external interface would really just be the equivalent of adding an SQL API endpoint. Securing a general purpose data query language is a very difficult problem. If the data that you are currently thinking of is not available in the Wiki Replica servers it has some security concern (unless it is raw revision data and then its just too big to replicate).

Can you refine your problem description a bit to help us understand what sort of free form query interface you are imagining? Starting with some concrete examples may help the rest of us follow your train of logic that leads to SPARQL/GraphQL being the best solution.

bd808 renamed this task from MediaWiki Action API does not allow complex queries to Lack of freeform external access to MediaWiki data is a limitation.Nov 23 2017, 6:34 PM

Can you refine your problem description a bit to help us understand what sort of free form query interface you are imagining? Starting with some concrete examples may help the rest of us follow your train of logic that leads to SPARQL/GraphQL being the best solution.

Done. I'm not sure if SPARQL/GraphQL is the best solution, totally open to other ideas. I just think, in most cases, Create Your Own API™ is an anti-pattern that we should try to avoid (if possible). Other organizations and companies I'm sure have the same problem(s) we have.. how have they solved this problem?

The biggest restriction that the MediaWiki API faces is that it is limited by whatever the underlying MySQL (or sqlite/postgres/etc.) database can serve in a fast and indexed way. Are you suggesting that we write a layer that turns sparql/graphql queries into something to run against the MediaWiki database? Or would it be more like WDQS where the data is copied into something that is designed for sparql querying (blazegraph)?

The biggest restriction that the MediaWiki API faces is that it is limited by whatever the underlying MySQL (or sqlite/postgres/etc.) database can serve in a fast and indexed way. Are you suggesting that we write a layer that turns sparql/graphql queries into something to run against the MediaWiki database? Or would it be more like WDQS where the data is copied into something that is designed for sparql querying (blazegraph)?

Uhh... I would prefer the former. As an example, there is a Drupal plugin that does exactly that (it relies on youshido/graphql). They also have to deal with access restrictions (on a per-object and per-field level).

I think it's completely fine if we limited the filters, groups, etc. to only the fields that are indexed, if this becomes a problem for some reason we can revisit it. I also think it's fine if we do not deal with mutations either (i.e. this would be read-only for now).

However, if it's better to copy the data to some other store, then that's also fine as the clients wont be aware of that implementation detail anyways.

As a task for MediaWiki-Action-API, I'm going to decline this.

As Lego already pointed out, the action API is heavily constrained by what's efficiently possible in MariaDB with the existing database schema and what changes to that schema are possible and reasonable. It's impossible to make efficient queries on arbitrary fields or to sort results by arbitrary fields, or to do arbitrary aggregations. Thus, there's no way that a request to allow such things can possibly be fulfilled within the existing constraints.

That's not to say that the action API as it currently exists offers access to all the data that it is possible to efficiently access in all the ways that it's possible to efficiently access it. I'm happy to help add functionality to the action API (in core or via extensions) when efficient SQL queries exist and a sane API can be provided for it, but where that's not obvious the onus is on you to provide those efficient SQL queries and to suggest a sane API if you want your task resolved. A general task like this is not going to be productive along that line, instead file tasks for specific needed functionality.

A "filter" that could limit arbitrary queries does not seem very feasible. Maybe it would be possible to make a restricted schema for GraphQL that only allows you to express queries that are well indexed, but if so you'd probably find that it would have to have many of the same restrictions you're complaining about here. And GraphQL couldn't replace the entirety of the existing API, only the query module, since it's a query language rather than a language for expressing write operations. In a more general query language, or with a less restricted GraphQL schema, such a filter seems like it would be a major AI research project before it would get to something suitable for the MediaWiki-Action-API tag, and I don't see a point in keeping this task open pending that research.

If you want to propose copying all of the data into some other storage engine, or to propose replacing MariaDB with some other storage engine, you're free to do so. But it's not going to go anywhere unless some team is prepared to devote the resources necessary to planning, development, hardware resourcing, and deployment of such an epic undertaking. It seems you're now using T180096 for that proposal, so using this task for it as well would be a duplicate.

dbarratt edited projects, added TechCom-RFC; removed MediaWiki-Action-API.

And GraphQL couldn't replace the entirety of the existing API, only the query module, since it's a query language rather than a language for expressing write operations.

Just to clarify, the GraphQL specification does specify how to express write operations:
http://graphql.org/learn/queries/#mutations

I will move this task to TechCom-RFC, thank you for your detailed response.

There are three big problems with allowing freeform queries:

  1. sandboxing to prevent (intentional or unintentional) denial of service through resource hogging
  2. schema mapping for stability and protection of private data
  3. no cacheability

A free query interface is really not an API, it needs quite different engineering. An API is deliberately narrow for ease of use, stability, and to allow optimization. So a free form query interface should not be seen as a replacement or alternative to the API, it's really complementary.

We already have two systems that allow free form queries against some of our data:

  1. the database replicas on tool forge. We have filtering there for privacy, and it's isolated from the production system to prevent DoS. But users of tool forge still compete for time and space there, and performance is an issue. But the major problem there is schema stability: we are exposing our db schema directly, making it a public interface instead of an internal implementation detail.
  2. Wikidata Query Service. Here we use a complex mapping from our internal model to an RFC model to provide stability, in addition to filtering out any private information and such. I think this is the better approach conceptually, and it also makes it easy to support things like GraphQL on top of SPARQL via tinkerpop. But it requires the design, implementation and maintenance of a complex mapping.

@Smalyshev is currently working on exposing the category structure via BlazeGraph. This seems conceptually similar to what you want. I suggest to re-frame this proposal, and ask for the meta-data on pages and revisions and perhaps recentchanges to be made available in BlazeGraph.

IMO this is way too vague to be useful for an RfC discussion. Things that I would like to see in such a proposal:

  • Use cases. It seems like any positive outcome of this RfC would result in a major engineering undertaking; that's not going to happen just because the resulting API "feels nicer". What Wikimedia projects would benefit from the change, and how much?
  • A sketch of how the new interface would look like. For example, are we talking about handling the current API response as a graph and using GraphQL on that (i.e. replacing the various *prop parameters), or querying the whole wiki DB as a whole?
  • Feasibility. Handwaving towards a bunch of acronyms is not a good substitute. SPARQL is a query language for semantic triplets. What would process those queries and how? How would GraphQL be translated to efficient SQL queries? (Is there any largish website using GraphQL with an SQL backend?)

Allowing arbitrary complex queries does not seem appropriate as a mediawiki core interface. Most people won't want their site potentially DOSed.

There are people who may want these sorts of query features, where the users are trusted. For those users it may make sense to make a MediaWiki extension that provides that (if someone wants to write such a thing), but it shouldn't be in core imo.

I'm not sure what "freeform external access" even means. If I understood correctly, the proposer would like an API module with a "Google-like" search, where you input an arbitrary set of keywords in a black box and you get some unpredictable output which you're hopefully going to like or do something with.

If you want SPARQL access to Mediawiki data, you can use this project: https://www.mediawiki.org/wiki/MW2SPARQL
though whether or not it's better depends on the task.

It seems to me the best way for this RFC to progress would be to focus on a solution that is already partially available: querying of meta-data via SPARQL, perhaps using the model proposed in https://www.mediawiki.org/wiki/MW2SPARQL, but via an export to BlazeGraph, similar to what we do for Wikidata. Allowing free from queries against the production databases themselves does not seem practical.

Allowing arbitrary complex queries does not seem appropriate as a mediawiki core interface. Most people won't want their site potentially DOSed.

Once upon a time this was a core MediaWiki feature ;-)

I am concern by this task mentioning GraphQL and SPARQL as tools aiming to do same thing.

  • SPARQL is great at doing complex selections (the WHERE close of the query) but is bad for data retrieval. It is possible to a table or a RDF graph. For graphs, specifying them is complex and verbose and client processing is not easy.
  • GraphQL is good for data retrieval (the query specifies the data tree that should be returned, e.g. [1]) but is limited on the selection part. The selection features is working with a set of defined parameter, just like the MW action API is doing (doc: [2]). So, it is easy to limit the complexity of the requests processing.

[1] https://tools.wmflabs.org/tptools/wdql.html?query=%7B%0A%20%20champollion%3A%20item(id%3A%20%22Q260%22)%20%7B%0A%20%20%20%20id%0A%20%20%20%20en_label%3A%20label(language%3A%20%22en%22)%20%7B%0A%20%20%20%20%20%20value%0A%20%20%20%20%20%20language%0A%20%20%20%20%7D%0A%20%20%20%20fr_desc%3A%20description(language%3A%20%22fr%22)%20%7B%0A%20%20%20%20%20%20value%0A%20%20%20%20%7D%0A%20%20%20%20fr_aliases%3A%20aliases(languages%3A%20%5B%22fr%22%5D)%20%7B%0A%20%20%20%20%20%20value%0A%20%20%20%20%7D%0A%20%20%20%20ru_article%3A%20sitelink(site%3A%20%22ruwiki%22)%20%7B%0A%20%20%20%20%20%20site%0A%20%20%20%20%20%20title%0A%20%20%20%20%20%20badges%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20label(language%3A%20%22en%22)%20%7B%0A%20%20%20%20%20%20%20%20%20%20value%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D%0A
[2] http://graphql.org/learn/queries/#arguments

You're right. I did some more reading on GraphQL and SPARQL and I understand the problems.

While I think it would be a good idea to wrap the Action API in GraphQL (to reduce the payload size and number of requests), it's still based on calling "actions" with named parameters. This is not unlike the Action API, and it seems like it would be somewhat "easy" to wrap the Action API in GraphQL.

However, that does not solve the problem here. To resolve that, a new parameter would have to be added to the action (or a new action created). I created T186556: Add a parameter Usercontribs API to limit results to contributions to pages that a list of users have edited to discuss adding that parameter.