Page MenuHomePhabricator

Investigate query parameter normalization for MW/services
Open, NormalPublic

Description

In parsing a few short log samples of all received URLs, I've noticed that we definitely do have inconsistencies in query parameter sorting. This in turn leads to cache duplication, which wastes space, wastes cache hit opportunities, and makes any related purging harder (in the case that they're even purgeable or purged).

Most of the examples I notice have to do with the relative ordering of the parameters action, ctype, feed, feed, format, printable. Some are mobileview queries, too. We could endeavor to "fix" this on the client side (in client app code, and in the parameterized URLs we emit in our own content output). We could also normalize it on reception at the cache layer by sorting all parameters, e.g. with https://github.com/vimeo/libvmod-boltsort . The biggest question-mark is whether we think all of our code is insensitive to query param order or not. It probably should be, but that's not necessarily true.

Relatedly, we could also normalize and/or fixup (again, in our output or clients) the unnecessary use of defaulted parameters. The most-obvious example is debug=false for load.php, which is commonly present and also probably the default, but redundant with the same query that might come in without the debug flag.

Event Timeline

BBlack created this task.Jun 17 2016, 4:44 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 17 2016, 4:44 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

For reference, here's the raw data at the end of some munging of a 5-minute sample of received URLs in a single cache_text node:

action=history&feed=atom&title=ONETITLE
action=history&feed=rss&title=ONETITLE
action=mobileview&format=json&noheadings=true&onlyrequestedsections=1&page=ONEPAGE&prop=text%7Csections&sectionprop=toclevel%7Cline%7Canchor&sections=1-
action=parse&format=json&page=ONEPAGE
action=parse&format=json&page=ONEPAGE&prop=text
action=query&format=json&meta=siteinfo&siprop=statistics
action=query&format=php&meta=siteinfo&siprop=namespaces
action=raw&ctype=text/css&title=ONETITLE
action=raw&ctype=text/javascript&title=ONETITLE
action=raw&title=ONETITLE
action=render&title=ONETITLE
printable=yes&title=ONETITLE
proto=https&type=1x1&wikiid=ONEWIKI

These are after stripping away everything but the query-string, replacing some obvious things like title/wikinames to find more common bits, and then de-duplicating, sorting the params, then looking for further duplicates. I don't have any idea how common they are, but those are the patterns that turned up quickly for "these params are at least sometimes sent in multiple orderings".

The default-parameter problem probably deserves it's own separate ticket and solution. Except for egregious legacy cases, we're probably not going to strip/add params to fix the defaulting issue at the VCL layer. Depending on your point of view on the problem, you can fix that by making clients more-verbose or simply not having defaults (if a code parameter becomes a query parameter at all, it has no default value and the parameter is required).

It would be interesting to know if query parameters end up being serialized in alphabetical order, which would make sorting them quite useless if so. From what I've investigated at least in mw desktop / web the order of the querystring seems to be given by how the parameters were laid out in the code.

http://jsbin.com/xojehakelo/edit?js,console

console.log($.param({a: 1, b: 2}))
// "a=1&b=2"
console.log($.param({b: 1, a: 2}))
// "b=1&a=2"

Meaning there should be some value in sorting the querystring params in the caching layer.

I'm not sure how apps end up serializing parameters, going to ping @Mholloway and @Fjalapeno for example to see if they can shine some light on how the query strings get constructed when querying the apis (alphabetical order or ordered as seen in the source?)


As of the rest of queries in the description, I believe they all do different things and may change the response a little bit sadly.

The only one may be ctype which doesn't change the response content but the content-type header on the response. If this is very widely used it may be an optimization interesting to apply.

My largest concern is still whether we can guarantee our application layer is insensitive to query parameter order, at least for MW and RB.

AFAIK I don't think that'd be a problem. Maybe @dr0ptp4kt has more insight.

Mholloway added a comment.EditedAug 25 2016, 5:22 PM

I'm not sure how apps end up serializing parameters, going to ping @Mholloway and @Fjalapeno for example to see if they can shine some light on how the query strings get constructed when querying the apis (alphabetical order or ordered as seen in the source?)

On Android it's as seen in the source; there's no reordering going on. The ordering is basically arbitrary but consistent. On the other hand, of course things occasionally change, and we can assume people will be running older version(s) with different params/ordering essentially forever. (For example, around a year ago we updated our mobileview API calls for page content to use a modern Android networking library, which changed the parameter order slightly; I'm sure there's a non-negligible number of clients out there from before the switch still hitting the API with the old param order.)

In short, I agree that sorting parameters in the caching layer seems most promising.

My largest concern is still whether we can guarantee our application layer is insensitive to query parameter order, at least for MW and RB.

My experience has been that MediaWiki is indifferent regarding parameter order but I'd also want to get a second opinion from @dr0ptp4kt on that to be sure.

As for RESTBase/MCS, parameters are in a defined structure in the request URL path[1] (and, as expected, I don't see any RESTBase or MCS queries in the sample), so I think we're safe there.

[1] See, e.g., the current endpoints on enwiki: https://en.wikipedia.org/api/rest_v1/

On iOS parameters are constructed using a dictionary which is unordered. (There is no concept of an ordered dictionary on iOS)

I agree, sorting in the caching layer seems promising

I haven't come across a case where the name-value pair ordering in the URL or form data is material. @Anomie @Tgr do you know of any?

I wonder if this would be a simple RFC.

In case I missed it in the context of the task, he actual | pipe concatenated values are probably sortable most of the time, too. I'm unsure how those may or may not be used in building hash structures at the origin or elsewhere.

Tgr added a comment.Aug 25 2016, 10:04 PM

The action API is not cached in Varnish unless the client sets the smaxage query parameter (in which case they probably know what they are doing and do apply some kind of parameter sorting). That's because there is no way to invalidate the results; once XKey is installed on the text Varnishes we should probably revisit that (that's T122867: Evaluate the feasibility of cache invalidation for the action API).

I don't know of any situation where parameter reordering would cause problems (short of Don't Do That Then edge cases where the same parameter is used multiple times). Pipe-separated strings cannot be reordered in Varnish because you don't know whether they are actually pipe-separated or a single parameter that happens to contain pipe characters. Also reordering them might change the output (sometimes there is a pipe-separated list of IDs and the objects are returned by the API in the same order). Making the common API client libraries (mw.Api, pywikibot etc) alphabetically sort parameters unless the caller opts out would be a breaking change but might be doable if it's worthwhile.

If we have anything that does multiple correlated foo[]-style arrays, that would be ordering dependent. For example, ?foo[]=100&bar[]=Z&foo[]=200&bar[]=Y would break if turned into ?bar[]=Y&bar[]=Z&foo[]=100&foo[]=200. I don't know if we have anything that actually does this rather than ?foo[0]=100&bar[0]=Z&foo[1]=200&bar[1]=Y though. I do know that Special:SecurePoll/create does the latter, safe version.

The same goes for a single such array if the array itself is order-sensitive. Again, I don't know of anything that actually does this, although the InputBox extension's 'preloadparams' field looks suspicious.

Lack of defined ordering in various languages' dictionaries doesn't necessarily protect you here since the value passed in for such a parameter is likely to be an array rather than a dictionary, e.g. { "foo": [ 1, 2, 3 ] }foo[]=1&foo[]=2&foo[]=3.

If anything iterates over $_GET or $_POST (or the array from WebRequest::getValues() or the like) for some reason, that might also be ordering-dependent. It also seems liable to be fragile, so perhaps we can ignore this one.

We sometimes put the CSRF token last in the form so if the form is truncated somehow the CSRF token will be missing and the submission will be rejected. We hopefully don't have to worry about undetected truncation between our own frontends and backends, though.

If we have anything that does multiple correlated foo[]-style arrays, that would be ordering dependent. For example, ?foo[]=100&bar[]=Z&foo[]=200&bar[]=Y would break if turned into ?bar[]=Y&bar[]=Z&foo[]=100&foo[]=200. I don't know if we have anything that actually does this rather than ?foo[0]=100&bar[0]=Z&foo[1]=200&bar[1]=Y though. I do know that Special:SecurePoll/create does the latter, safe version.

So long as our sorter preserves the relative order of duplicated parameter names (where name means everything left of the equal sign), we should be able to avoid all such problems, right?

So long as our sorter preserves the relative order of duplicated parameter names (where name means everything left of the equal sign), we should be able to avoid all such problems, right?

Yes, that would avoid both potential problems when using foo[]-style arrays.

Apparently v3's libvmod-boltsort and v4's std.querysort() both fail to do so. Nothing about the handling of duplicate param names (or for that matter the order-sensitivity of params in general) is standardized at the HTTP level, so this is all grey-area on standards. But it seems multiple server-side languages support some variant of this, including some variants that build an array foo by simply doing ?foo=1&foo=2&foo=3. It might make sense for us to put in a pull request to varnish4 to at least optionally support doing a stable sort that considers identical parameter names to be matches for sorting purposes. It's probably not worth messing with this in v3 for its remaining few months here. Worst case we could simply skip sort-normalization in v3 for any query-string that contains brackets.

Worst case we could simply skip sort-normalization in v3 for any query-string that contains brackets.

That sounds very acceptable.

Thanks everybody for chiming in.

BBlack moved this task from Triage to Caching on the Traffic board.Oct 3 2016, 3:08 PM
Gilles added a subscriber: Gilles.Nov 9 2016, 2:45 PM
GWicke moved this task from Backlog to watching on the Services board.Jul 12 2017, 5:19 PM
GWicke edited projects, added Services (watching); removed Services.