Page MenuHomePhabricator

Support POST for SPARQL query endpoint
Closed, ResolvedPublic3 Estimated Story Points

Description

While SPARQL endpoint for Blazegraph and WDQS supports only GET requests, many SPARQL query tools (e.g. http://openuplabs.tso.co.uk/demos/sparqleditor) insist on only sending POST requests. We may want to consider converting POST to GET in nginx or varnish if possible, in order to serve such tools, or use redirect.

See also: https://en.wikipedia.org/wiki/Post/Redirect/Get
https://en.wikipedia.org/wiki/HTTP_303

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Another argument for enabling POST requests is, that GET requests / URLs are limited in length. The exact limit depends (see http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers), but often seems to be about 2000 characters. I hit this in the past in several real queries.

Change 243883 had a related patch set uploaded (by Smalyshev):
Allow SPARQL endpoint to be queries via POST

https://gerrit.wikimedia.org/r/243883

You can test your endpoint with this tool (View->Show endpoint bar and write the url of your endpoint).

http://openuplabs.tso.co.uk/demos/sparqleditor

Try to respect the standard. (there is again a problem with the content negociation)
http://www.w3.org/TR/sparql11-protocol/#query-operation

Puppet patch added above, which needs review from puppet experts. @Joe, can you help us with this? Note that this is not urgent and can wait until the offsite has ended. :-)

I've tested it in labs and it seems OK, would like somebody from ops to review if it's OK for production.

WDQS is in internal tool, so if its API doesn't conform to clients' expectations then this is a bug in WDQS, yes? Would it not be better to fix that rather than adding mysterious middleware obfuscation?

I'm not a fan of this on a few levels:

  1. As Andrew said above, why not support this directly in WQDS if you have to support it at all? as in, let the POSTs come through the rest of the stack unmolested, and deal with it inside WQDS, instead of a dubious POST-to-GET conversion in the midst of our inbound traffic stack.
  2. We don't want more app-specific hacks in nginx|varnish|apache config than necessary, and this seems unnecessary.
  3. Regardless of which layer handles the issue: It breaks the semantics of GET as readonly and POST as a potential write-op. POSTs as a rule as not cacheable responses, even though most (all?) SPARQL traffic is readonly and probably can be cached with at least some minimal TTL. This also conflicts with our overall strategy for multi-datacenter work, which involves sticky-routing via cookies to the primary DC at the applayer for POST requests, but allowing normal balancing of GET/read requests.

As Andrew said above, why not support this directly in WQDS if you have to support it at all?

Because in Blazegraph, allowing POST means allowing write requests. This is not good for security.

let the POSTs come through the rest of the stack unmolested, and deal with it inside WQDS

This would mean building some middleware layer to filter read requests from write requests, which would create both a lot on unnecessary work and a security issue if the filtering is not perfect. Not allowing POST on Blazegraph side automatically cuts off all modification requests since GET processing engine in Blazegraph only supports read queries. POST processing supports both, so we'd have to work much harder to achieve same level of security.

even though most (all?) SPARQL traffic is readonly and probably can be cached

I don't think it is a good idea to cache SPARQL queries. They are big, they are rarely repeated as-is, and if they are repeated this is usually because the client expects new result. At least until we have some setup that repeatedly requests same data over SPARQL, I don't see much point in caching SPARQL responses.

This also conflicts with our overall strategy for multi-datacenter work,

Given that there is no multi-datacenter setup for wdqs, I'm not sure how it is relevant. Most SPARQL clients for which it is relevant wouldn't probably have cookie storage mechanisms anyway.

As Andrew said above, why not support this directly in WQDS if you have to support it at all?

Because in Blazegraph, allowing POST means allowing write requests. This is not good for security.

So even in blazegraph, GET and POST have standard semantic meanings... why is it that clients don't honor this?

even though most (all?) SPARQL traffic is readonly and probably can be cached

I don't think it is a good idea to cache SPARQL queries. They are big, they are rarely repeated as-is, and if they are repeated this is usually because the client expects new result. At least until we have some setup that repeatedly requests same data over SPARQL, I don't see much point in caching SPARQL responses.

It would be wiser to give them some (perhaps minimal) cacheability via Cache-Control, if nothing else as a buffer against simplistic DoS attacks that spam the same query at high rates...

This also conflicts with our overall strategy for multi-datacenter work,

Given that there is no multi-datacenter setup for wdqs, I'm not sure how it is relevant. Most SPARQL clients for which it is relevant wouldn't probably have cookie storage mechanisms anyway.

There is eventually multi-datacenter for everything in the long term, so yeah that includes wdqs and is very relevant. Every service is going to have to account for it in the long run with how state is managed, and one of our mechanisms for balancing traffic and avoiding state replication lag is the idea that while a normal (no special cookie) GET request is balanced between the DCs based on geography and load like normal, POST requests are always directed to the primary DC only. Additionally, once a POST request is seen, it sets a short-duration session cookie that maps all of that client's GET requests to the primary DC only as well (so that they don't suffer lag effects in reading the results of their own modifications). If a read-(only|mostly) service uses POST for all of its traffic due to client deficiency, all of that traffic will be stuck on the primary DC only and not benefit from read load balancing.

Short: Because IE is broken.
Long: The problem here is that while https://tools.ietf.org/html/rfc2616#section-3.2.1 says "The HTTP protocol does not place any a priori limit on the length of a URI." this reality doesn't conform. On the server side our caching layer is AFAIK even in violation of "SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs". It probably caps at around 16k. In reality we can probably live with that as the limit on queries for the public queries.wikidata.org. But while we might manage to fix all the SPARQL tools, we can't fix all the browsers. Which leaves us with: "Servers ought to be cautious about depending on URI lengths above 255 bytes".

To best serve the listed requirements we probably need to have generic handling for this in the layer that differentiates between POST and GET in the described multi data center way. It would e.g. know by a certain query argument in the Request-URI that this POST request has GET semantics with the actual resource specified by the Request-URI together with the request body. Should we then transform the request to a GET request or handle this special case all the way in the stack? Any better ideas?

Yesterday in IRC it seemed like there was consensus we don't have any actual (quantifiable) concerns regarding query length at the moment.

SMalyshev: chasemp: I'm not worried too much about query length

The 2008 guidelines seem explicit that GET is the mechanism of choice up to the point that length for GET isn't viable. The 2013 guidelines still indicate the use of POST query but also outline that it isn't the only operation for which POST is necessary:

This protocol is a companion to the use of both SPARQL Update and SPARQL Query over the SPARQL      
protocol via HTTP POST. Both protocols specify different operations performed via the HTTP POST method.

At first glance, it seems like this means mangling all POST to GET would result in an incompatibility with the sparql protocol.

If we are saying we don't want to leave clients (which follow a somewhat odd spec to my mind) in the dust I don't think we are accomplishing that, and as Brandon outlined POST in the Wikimedia multi-dc world has specific consequences.

Besides knowing that at some point length is a factor do we have a demonstratable need for this functionality (which will be very nuanced in implementation)?

At first glance, it seems like this means mangling all POST to GET would result in an incompatibility with the sparql protocol.

I didn't suggest that. I suggested to apply GET semantics to POST requests that indicate this via a part in the query of their Request-URI.

If we are saying we don't want to leave clients (which follow a somewhat odd spec to my mind) in the dust I don't think we are accomplishing that, and as Brandon outlined POST in the Wikimedia multi-dc world has specific consequences.

Would you describe how my suggestion doesn't accomplish that?

At first glance, it seems like this means mangling all POST to GET would result in an incompatibility with the sparql protocol.

I didn't suggest that. I suggested to apply GET semantics to POST requests that indicate this via a part in the query of their Request-URI.

Most of the discussion in this ticket is surrounding https://gerrit.wikimedia.org/r/#/c/243883/

If we are saying we don't want to leave clients (which follow a somewhat odd spec to my mind) in the dust I don't think we are accomplishing that, and as Brandon outlined POST in the Wikimedia multi-dc world has specific consequences.

Would you describe how my suggestion doesn't accomplish that?

I wasn't meaning to respond to the outline you provided necessarily, but is anyone being limited by this now? Are users submitting queries from IE at the moment that are limited in this way?

I am looking for a use case we are trying to solve. The sort of 'we are trying to do X now and can't because of this' that we can revolve the discussion around.

The first sentence in the task is the use case I looked at while coming up with my suggestion. To explain why these tools would use POST I came up with the Request-URI length. But another workable explanation is that the tool just doesn't bother to differentiate between write (update) and read-only query and thus to support both it always uses POST, which is a superset of what works with GET.

I am looking for a use case we are trying to solve.

The use case is users using SPARQL query tool that only does POST. Unfortunately, such tools exist, and in non-nelgligible numbers

At first glance, it seems like this means mangling all POST to GET would result in an incompatibility with the sparql protocol.

@chasemp I think you're missing an important point here - lack of SPARQL Update support on query endpoint is intentional. That's the whole point of the exercise - how we allow to send POST without allowing SPARQL Update. We do not want to be compatible with SPARQL Update in query endpoint - in fact, we want to explicitly forbid it, since we don't want the whole internet to mess with our database (there's wikidata.org site for that ;) That's why we do not want to allow anybody from outside to POST to Blazegraph. However, we do want to allow tools that use POST to do SPARQL Query to do so. Unfortunately, distinguishing SPARQL Query from SPARQL Update and other update requests may be non-trivial to do from something like nginx, so it is much safer and easier to just never send POST to Blazegraph, thus ensuring we never produce an update.

We could, of course, take a stance that since REST dictates retrieval queries should go through GET, we only support GET. However, this stance sounds to me unpractical and not user-friendly, as most users do not care about the purity of our REST track record (in fact, most of them have only the vaguest idea of REST and their requirements about POST/GET) and just want their SPARQL tool (which they probably use with a dozen of other SPARQL endpoints by now) to work with WDQS endpoint.

Answering @BBlack's question, we do not know why these tools only use POST, but most importantly, it's completely beyond the point why they do it - whatever the reasons are, they do it, and we're not going to change that. So we can either support them or refuse to support them. I do not see any practical reason to refuse support given that we can do it.

As for cookies and multi-DC setup, many non-browser clients would just ignore cookies completely anyway. For browser-based clients, I don't think in the perceivable timeframe we'd get traffic numbers that would make it any problem. The traffic numbers now are low, and if they raise most of the traffic would be bots, tools and widgets feeding data from SPARQL, not browser traffic. If it ever does become a problem, it should be trivial to make an exception for requests coming to query.wikidata.org - they are easily identifiable. Also, we will never have all or significant part of the overall traffic produced by these tools - these tools are user-friendly frontends, but the bulk of the work will be done by automated tools, just like a typical SQL database would serve most traffic via programmatic connections, not via Web interface.

Smalyshev changed the task status from Open to Stalled.Nov 2 2015, 8:22 PM

Due to negative reviews, moving to Stalled.

I am looking for a use case we are trying to solve.

The use case is users using SPARQL query tool that only does POST. Unfortunately, such tools exist, and in non-nelgligible numbers

At first glance, it seems like this means mangling all POST to GET would result in an incompatibility with the sparql protocol.

@chasemp I think you're missing an important point here - lack of SPARQL Update support on query endpoint is intentional. That's the whole point of the exercise - how we allow to send POST without allowing SPARQL Update. We do not want to be compatible with SPARQL Update in query endpoint - in fact, we want to explicitly forbid it, since we don't want the whole internet to mess with our database (there's wikidata.org site for that ;) That's why we do not want to allow anybody from outside to POST to Blazegraph. However, we do want to allow tools that use POST to do SPARQL Query to do so. Unfortunately, distinguishing SPARQL Query from SPARQL Update and other update requests may be non-trivial to do from something like nginx, so it is much safer and easier to just never send POST to Blazegraph, thus ensuring we never produce an update.

We could, of course, take a stance that since REST dictates retrieval queries should go through GET, we only support GET. However, this stance sounds to me unpractical and not user-friendly, as most users do not care about the purity of our REST track record (in fact, most of them have only the vaguest idea of REST and their requirements about POST/GET) and just want their SPARQL tool (which they probably use with a dozen of other SPARQL endpoints by now) to work with WDQS endpoint.

Hey @Smalyshev sorry you were ready to roll on this and it got locked up in debate. I appreciate what you are saying here I really do. In large part it comes down to the idea of what is practical as you said. We have opposite ideas in this case. From my POV here we are trying to hide the complexity for poorly written clients on our end and we have few resources to do such things over the long term. We do make many decisions about how far down this gradient to travel in the pursuit of sane user expectations, SNI is a good example where we are using it though we know a not insignificant amount of users are suck in XP land. I'm not saying this is a 1:1 example.

I do think we are missing each other on the 'use case' bit here. I believe I understand this:

The use case is users using SPARQL query tool that only does POST. Unfortunately, such tools exist, and in non-nelgligible numbers

From this angle is every poorly written tool a use case? We don't seem we have a backlog of users who are held up trying to use POST to make queries and without specific examples to push this narrative along I am viewing this as premature complexity. We know poorly written tools exist for everything. We generally don't try to head them off at the pass.

I did look through a bunch of sparql tools to get an idea and there was at least one default POSTer that I found, but it can be changed. The majority of tools I see do the sane thing:

https://github.com/BorderCloud/SPARQL/blob/master/Curl.php#L244
https://github.com/RDFLib/sparqlwrapper/blob/master/SPARQLWrapper/Wrapper.py#L503

Not that this is a slam dunk argument either way. Maybe we revisit in the near future but at the moment I can't see my way to this making sense to support. If you feel strongly about this bring it up in Scrum of Scrums to get more visibility?

It is not a question of resources. No resources are needed for it, it's already done and the only thing missing is permission from Ops to actually run it. It costs us literally nothing but willing to do it.

Now we have a choice whether we want to serve our users, which asked us for it, however theoretically imperfect the using POST for queries may be, or we can reject them telling them their tools are not conforming to our high theoretical standards so they can not get access to our data and they can do nothing about it (telling them to fix those tools is akin to telling people to fix Windows XP - it's just not going to happen, they are users, not creators of those tools). I do not see anybody who benefits from such rejection, but since it's not my decision, I don't see what else I can do here. The patch is there, if there ever will be the decision to use it, it'll still be there. Until then, the access to our data for those users will not be possible.

This is not only a question of the tools in use, but also it also imposes length restrictions to queries, which are transmitted in the GET URL.

To cite the Stackoverflow entry mentioned above: "Extremely long URLs are usually a mistake. URLs over 2,000 characters will not work in the most popular web browsers. Don't use them if you intend your site to work for the majority of Internet users."

Complex queries hit this limit quite easily (for examples from another application domain, see https://github.com/jneubert/skos-history/tree/master/sparql/stw). And we won't enforce bad practices like waiving comments and indents, one-character variable names and the like.

Another use case are queries with long VALUES lists as input (e.g. of names or codes from another application). Such clauses also can easily exceed any reasonable GET URL length.

POST isn't just theoretically imperfect. Proliferation of POST for what should be cacheable, readonly, idempotent queries is a serious long-term problem for us. Primarily it's that we can't cache them in Varnish like we should be able to (which absorbs over 90% of all public traffic before things get down to the application layers), but as mentioned before it's going to affect how multi-DC request routing works as well, the current design of which is built on the idea that HTTP methods are used correctly (that we can sticky users to the primary writeable DC on POST for updates, and otherwise distribute their load fairly when they're doing only GETs and haven't done POST in a while). Of course we already have cases like this though, even in the MediaWiki API. But still, trying to stick to standards here isn't just a matter of theory trumping user requests. It's a matter of saving ourselves future pain when we debug performance and availability issues across our entire infrastructure as whole, including this one of many application services.

AFAICS there is no general solution to this problem. You could call it a deficiency in the HTTP protocol itself. On the one hand, we have strong reasons to prefer that methods are used correctly (i.e. GET used for readonly idempotent operations, POST used for non-idempotent operations that modify server-side data). On the other hand, we face a serious restriction in the amount of data that can be passed in a GET query which can only be realistically worked around with long POST body data.

There's no right answer, but I think answers which make use of GET are the better answers here (because they preserve the meaning of the methods). Most of the possible workarounds with GET basically boil down to:

  1. Find a way to keep the query strings under 2K (which should be reasonable for a whole lot of data models!). For this you can encode queries better (e.g. use short key names instead of long textual ones, etc), you can compress the query strings prior to encoding, etc. Often it's simply a matter of bad data models, or very inefficient encoding of said data model into the URI. For cases where the basic entropy of the possible query strings is always going to exceed 2K even after efficient coding and compression (which is probably always going to be the case for an open-ended generic query language like SPARQL?), you can look at the other two options:
  1. Making it a two-phase operation: a POST of a large query to "save" the query itself for that user/session under some label/index, and then one or more GET operations after that which make use of the saved query and actually execute it for results. Persisting these (as opposed to just nailing a POST in front of every GET) would be ideal, as is letting users share/reuse common queries where it makes sense, etc.
  1. Using headers. Header limits are generally higher than URI limits, and usually the limit is tunable server-side, so you can for instance put the bulk of the data in a request header like X-SPARQL: .... (and then still do things from (1) to keep the size in check).

The idea of compressing away comments and indents isn't "bad practice" - it would in fact be very good practice in this case. HTTP URIs are not text editors or source-code-storage, they're essentially just the wire protocol for transmission of a compressible idea at this layer. If there's a problem with client tooling, we can submit patches to those client projects and/or nudge them in the direction of these arguments.

Yes. But: 1) This ticket is only about compatibility with existing clients we didn't create and don't control. Those clients are conforming implementations of SPARQL 1.1 2) Option 2 and 3 in T112151#1818392 are not defined in the SPARQL 1.1 protocol. See http://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/#query-operation . 3) That standard explicitly allows POST for read queries. (I'll spare you a long rant about W3C standards.)

Primarily it's that we can't cache them in Varnish like we should be able to

Caching SPARQL results via Varnish is not a good idea in most cases. For two reasons:

  1. Usually the same query is not repeated very often (unless somebody has a buggy script, etc.). Every client would run its own query and once the result is returned, it will go away and another client would use another query.
  2. When the query is repeated, it is usually because the client expects something to change - like getting update from Wikidata. Since we do not have any real way to invalidate Varnish when underlying data changes, caching here would only hinder the clients.

Also, these results can be pretty big - so we'd be storing tons of information which is not useful. Of course, there are exceptions and some scenarios would still benefit from caching, but most would not.

but as mentioned before it's going to affect how multi-DC request routing works as well,

There are no plans of multi-DC setup for WDQS, as far as I know.

AFAICS there is no general solution to this problem.

We do not need a general solution for the whole HTTP world. We need a particular solution for specific practical issue, and we have everything we need to actually solve it - except willingness to solve it.

But still, trying to stick to standards here isn't just a matter of theory trumping user requests.

I think in this particular case it is exactly that, as all objections so far, when applied to what happens with WDQS, appear to be only theoretical.

All proposed solutions require creating a whole new middleware layer, and I don't think anybody is ready to allocate resources to develop and maintain this layer.

If there's a problem with client tooling, we can submit patches to those client projects

Not all of those are open source projects, and they would not spend resources on rewriting their tools just because one particular endpoint (ours) does not support POST. Also, who exactly would be spending time on learning these tools, developing and submitting those patches?

Smalyshev changed the task status from Stalled to Open.Jan 13 2017, 12:48 AM

It looks like we have one more important case for POST support. When federation query is used, the query size can get pretty big (since it contains a lot of values) and many tools switch to POST in this case. This includes Blazegraph itself, but I imagine also other tools would be acting the same way. Maximum URL size our server would accept is around 7K, which is way lower than some federated queries would need.

Note that Blazegraph is happy to accept longer GET URLs, nginx would accept up to 8K by default but this can be extended with large_client_header_buffers as far as I can see. Looks like the limitation I'm seeing now is on varnish/frontend side.

Examples for the federated query use case see at https://github.com/zbw/sparql-queries/tree/master/wikidata#power-queries (queries with econ_pers or ebds in name). The intermediate set fed into Wikidata was about 450,000 GND IDs. Currently, that requieres setting up a custom Wikidata endpoint.

Java/Blazegraph seems to have URL length limit around 8K. So we may have to support real POST on Blazegraph side anyway.

Change 243883 merged by Gehel:
Allow SPARQL endpoint to be queried via POST

https://gerrit.wikimedia.org/r/243883

Please make a statement to clarify whether POST can be used in place of GET without downsides or should be limited as a workaround for long queries.

POST can be used with any queries, same as GET, though I'm not sure how it would be cached, so I'd recommend using POST only for big queries now, unless you don't have a choice (e.g. the tool only supports POST). I'll update about caching once I know it.