Page MenuHomePhabricator

Consider using URL compression with pre-defined dictionary to shorten URLs; possibly use this instead of RESTBase POST storage
Closed, DeclinedPublic

Description

Most use cases for the generic POST storage developed in T105975 are basically about shortening URLs to fit comfortably in the ~2083 bytes supported by IE. Many of those use cases would actually straddle that limit only rarely.

While POST storage works, it has some significant disadvantages:

  • Each re-render needs to POST the same data to RESTBase.
  • POST data can't be freed, unless we'd embark on a very elaborate dependency tracking scheme.
  • There is some significant complexity in end points using it.

It would thus be interesting to avoid using it for applications where we can bring down the size of URLs through other means.

One such technique is to use compression. Naive compression on short strings does not achieve significant savings, as the amount of repetition is relatively small, and the compressor dictionary is fully embedded in the output. However, many URLs contain very predictable data, which can be leveraged to improve compression by pre-initializing the compressor dictionary with a string containing the most common request data tokens for this end point.

Simple example, using nodejs:

var options = { 
  level: 9, 
  // In real applications, this would be a bigger sample of typical request data for a specific end point.
  dictionary: new Buffer('https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|pageimages|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&piprop=thumbnail&pithumbsize=600&rvprop=timestamp&titles=') 
};

// Note: URL differs from dictionary string (size and title)
var urlToCompress = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|pageimages|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&piprop=thumbnail&pithumbsize=640&rvprop=timestamp&titles=Master_System';

zlib.deflateSync(urlToCompress, options).toString('base64')
'ePlS/1mMyxgmvjTB7UvfxOKS1KL44EoglQsA6g9dvA=='

// Same, but using modified base64 to avoid percent encoding: https://en.wikipedia.org/wiki/Base64#URL_applications
zlib.deflateSync(urlToCompress, options).toString('base64').replace(/[/=+]/g, function(x) { return {'/': '_','+':'-','=':'.'}[x]; })
'ePlS_1mMyxgmvjTB7UvfxOKS1KL44EoglQsA6g9dvA..'

In this (admittedly somewhat optimistic) example, we achieved a compression from 240 bytes to 44 bytes, or a ratio of 18.3%. While many other use cases won't match as long substrings as in this example, many other use cases will see significant savings by avoiding percent encoding overhead.

Zlib is widely available on many platforms. The main consideration is that producers and consumers will need to use the same dictionary string to compress URLs for a given entry point. This introduces a tight coupling of configurations, but arguably this coupling isn't any tighter than that induced by POST storage.

Figuring out whether this works for a given entry point

To evaluate whether the compression ratios achievable are good enough for a given entry point, we'll need to construct a dictionary string for that entry point. Given some representative example data, tools like dictator can be used to compute a near-optimal dictionary string (Example usage: go run dictator.go 32768 /path/to/testdata-dir /tmp/test.dict). Alternatively, a manual approximation will work, too. Then, we'll need to test compression ratios for a representative sample of requests, focusing especially on the largest expected requests.

@Yurik, @Physikerwelt: Does this sound like an interesting experiment to you?

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke edited subscribers, added: Eevans; removed: Aklapper.
GWicke renamed this task from Investigate feasibility of using URL compression with pre-defined dictionary instead of post storage to Consider using URL compression with pre-defined dictionary instead of post storage.Nov 6 2015, 8:21 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke renamed this task from Consider using URL compression with pre-defined dictionary instead of post storage to Consider using URL compression with pre-defined dictionary to shorten URLs; possibly use this instead of RESTBase POST storage.Nov 6 2015, 9:17 PM
GWicke updated the task description. (Show Details)

@GWicke: Let me evaluate if compressing math input tex would be an option...

the longest formuale I could find on wikipedia is 2117 bytes uncompressed user input text. If I use the windows 8 zip program the result is 570 bytes.
That said it's an option for formulae currently used on the english wikipedia.
See some example formulae here
https://gist.github.com/physikerwelt/37d86b8440c09e033fe0
In general this would simplify things significantly.

@Physikerwelt: Thanks for looking into this! Your results are very encouraging.

The JSON file you uploaded isn't completely valid. I can try to massage it into submission, but if you have a way to upload a valid JSON file, then that would be great. I can then do some tests with a pre-defined dictionary for math.

Edit: Managed to massage it, so don't worry.

@GWicke: Just FYI. I created this file by using the export to json function of the phpstorm database pugin.... unfortunatly there were no options to add the missing string delimiters.

@Physikerwelt: Here are the results from a pass over the 500 largest formulas:

{
  "urlenc": {
    "avg": 1759.358,
    "sum": 879679,
    "max": 7261
  },
  "raw_tex": {
    "avg": 949.206,
    "sum": 474603,
    "max": 4979,
    "of_urlenc": 0.5395183925045386,
    "of_tex": 1
  },
  "nodict": {
    "avg": 241.156,
    "sum": 120578,
    "max": 840,
    "of_urlenc": 0.13707045410882834,
    "of_tex": 0.2540607623634912
  },
  "nodict_b64": {
    "avg": 322.88,
    "sum": 161440,
    "max": 1120,
    "of_urlenc": 0.18352148908863347,
    "of_tex": 0.34015798467350605
  },
  "dict": {
    "avg": 172.136,
    "sum": 86068,
    "max": 721,
    "of_urlenc": 0.09784023490386834,
    "of_tex": 0.18134735768631888
  },
  "dict_b64": {
    "avg": 230.832,
    "sum": 115416,
    "max": 964,
    "of_urlenc": 0.13120240451346457,
    "of_tex": 0.24318430351262002
  }
}

@GWicke: Thanks for testing. Now we have good data estimates to discuss.

So we have a 15-20% size advantage, if we use the dict. But the price for that is that every client would need to know the dictionary, right?

Given that 1120 is not too close to 2083 I think it would be easiest to use generic (no dict) compression. Especially for maintenance and debugging the directory is an additional burden.

@Physikerwelt: I think it's worth looking at a larger sample size before making the decision. You are right that the effect for the largest math inputs isn't that huge, but the savings will be greater with smaller formulas, which are very common. Upload bandwidth is normally scarce, so keeping an eye on average size is important, too.

So we have a 15-20% size advantage, if we use the dict. But the price for that is that every client would need to know the dictionary, right?

Not necessarily, as we can still support POST and possibly basic compression as well. Example:

  • GET /media/math/{format}/z/<deflate compressed data>
  • GET /media/math/{format}/z1/<deflate + dict compressed data>, using dictionary 'z1'
  • POST /media/math/{format}

Just POST and optimized compression probably cover most use cases.

I'm not too happy about a dictionary because this means the service can never change its syntax. For example, if graph v2 gets a new syntax, we won't be able to simply recalculate dictionary, as we are forever stuck supporting older URLs that rely on it, thus forever stuck with less than optimal compression.
Storing data in url mean that we have to roundtrip the data to the client, which goes against our strive for efficiency, especially in mobile and slow connection. 2kb uncompresable per link seems excessive, if compared with ~10? bytes of a short compressed url with an ID.
From my understanding, the goal of POST storage is to have a distributed storage to easily share data between service(s) and mediawiki. Current setup enforces the mediawiki centric data model, with database accessible only via php code. This forces each service to also implement a mw api to access the data when the service needs it.
POST storage doesn't have to be 100% fault tolerant, as it can be refreshed whenever article is rendered. Thus it can also expire items by setting TTL > 1 month.

@GWicke: Allright. Here another sample https://gist.github.com/physikerwelt/2ae25766e991e0d7a1b2 (I called the file jso to indicate that it's not json yet).
I would be surprised if we would see a larger effect there. But it would be a good surprise.

Results for the 500 random math tag sample:

{
  "urlenc": {
    "avg": 88.356,
    "sum": 44178,
    "max": 785
  },
  "raw_tex": {
    "avg": 50.236,
    "sum": 25118,
    "max": 487,
    "of_urlenc": 0.5685635384127846,
    "of_tex": 1
  },
  "nodict": {
    "avg": 44.214,
    "sum": 22107,
    "max": 210,
    "of_urlenc": 0.5004074426184979,
    "of_tex": 0.8801258061947608
  },
  "nodict_b64": {
    "avg": 60.232,
    "sum": 30116,
    "max": 280,
    "of_urlenc": 0.6816967721490335,
    "of_tex": 1.1989808105740902
  },
  "dict": {
    "avg": 36.718,
    "sum": 18359,
    "max": 172,
    "of_urlenc": 0.41556883516682513,
    "of_tex": 0.7309101043076678
  },
  "dict_b64": {
    "avg": 50.272,
    "sum": 25136,
    "max": 232,
    "of_urlenc": 0.5689709810312825,
    "of_tex": 1.0007166175650928
  },
  "dict_b64_min": {
    "avg": 41.286,
    "sum": 20643,
    "max": 222,
    "of_urlenc": 0.46726877631400243,
    "of_tex": 0.8218409109005494
  }
}

The new dict_b64_min category is base64 of dict-compressed binary, with

  • the first 8 bytes (magic bytes and dictionary checksum) omitted, and
  • optional trailing '=' from base64 removed.

The average length using this scheme is 41.2 bytes, only 1.2 bytes longer than a hex-encoded sha1, and 13.2 bytes longer than a base64-encoded sha1. This scheme is also about 32% smaller than basic deflate + base64's average of 60.2 bytes.

I think it's clear that deflate or deflate + dict are good options for math. Especially with the optimized deflate scheme, average sizes are comparable to hashes. Max sizes are comfortably within the limits using either deflate version. It also seems that the limits are a bit more generous than originally assumed: This MSDN article mentions that links and likely src attributes up to 5kb work in IE >= 8.

I'm not too happy about a dictionary because this means the service can never change its syntax.

Services can still change their syntax at any time. At worst, the compression ratio using a specific dictionary would lose a part of its edge when the format changes. The ratio should never be worse than using no dictionary at all.

We can combine basic deflate without specific dictionaries with the use of dictionaries for consumers interested in the substantial gains. Thanks to the deflate spec, we can detect which version is used based on the first two bytes of the data: base64-encoded deflate without dict is guaranteed to start with 'eN', while dictionary-compressed deflate starts with 'eP'. The default scheme follows this with a 32-bit adler checksum of the dictionary for a total header size of 6 bytes (8 bytes in base64). We could however save seven of those bytes by indicating the use of specific dictionaries in the first byte, using any character but 'e'.

Whether the added complexity for dictionaries is worth it for a specific end point depends on the number of consumers, as well as the actual gains seen for the specific use case. This trade-off will likely differ between applications / API end points. In any case, we don't need to make this decision up-front. We can start with basic gzip, and add dictionary optimizations incrementally.

@GWicke All in all this looks very promising.
Are their open questions or technical challenges that need to be resolved?
Otherwise, I would prefer this method compared to the x-resource-location hash (https://gerrit.wikimedia.org/r/#/c/245478/28/MathMathML.php) for the math extension because of the following reasons:

  1. The link to the PNG or SVG image would still contain the information what is supposed to be displayed in the image. In the past we had bug reports that only reported the link to the image, so it was impossible to get the input without accessing the storage.
  2. This will make the math renderer state free, which simplifies debugging, testing and code maintenance.
  3. About 42 bytes per formula seems to be the correct answer for questions that might occur in the future;-)

In conclusion, I'd suggest implemeting your proposal for mathoid on top of https://gerrit.wikimedia.org/r/#/c/230080/
@mobrovac: What do you think?

I think this approach significantly complicates the SLA between RESTBase and its clients, since clients would need to:

  • implement special data-massaging functionality to issue a request
  • agree with RESTBase and/or the corresponding back-end service on the dictionary to use and (more importantly) keep it in sync

Additionally, the majority of POST storage requests fall in the < 2kB category (as also outlined in the task's description) and all of them are internal requests (since POST storage is limited to internal IPs only).

Also, there are genuine cases where the POST storage is used as - storage. One such case is the Graph extension which wants to store graph definitions in RESTBase.

I wouldn't go as far as to say that the current POST storage model has significant disadvantages:

  • Each re-render needs to POST the same data to RESTBase.

Re-renders can be triggered with a GET request using the hash as well. If the client obtained the hash, it is implied RESTBase has got the original POST data in storage, so it is able to retrieve it and use it to send a request to the appropriate back-end service.

  • POST data can't be freed, unless we'd embark on a very elaborate dependency tracking scheme.

Given that the size of the vast majority of these requests is less than a couple of kBs, the impact on storage is minimal.

  • There is some significant complexity in end points using it.

I'm inclined to say that using both client- and server-side data massaging and keeping the way of doing it in sync is more complicated than what we have now: a back-end entity provides the data, RESTBase hashes it and stores it, and from that point onwards all of the actors involved can deal with the content's hash exclusively.

I'm not at all convinced that this SLA complication is needed.

I think we should have a discussion about some shared storage service, ideally without an extra inter process call. Basically we need a way for both mediawiki and node services to both store and retrieve the same information. This way graph mw ext can store data and graphoid can later get it without going via the public api. BTW, this could be a direct access to the MySQL cluster, as long as we have a standard shared lib that does it.
P.S. Wow @mobrovac, we agreed on something!? Are you sure?))

implement special data-massaging functionality to issue a request

Actually, calling zlib.deflate is simpler and faster than performing a POST request & extracting the hash from the response. It is a single line of code in most environments, and there is no need to wait for the network. This matters especially in environments with little support for parallelism like MediaWiki.

agree with RESTBase and/or the corresponding back-end service on the dictionary to use and (more importantly) keep it in sync

We don't need to use a dictionary. If we choose to use one, then it is no harder than adding a string constant to the code. If we can remember where to POST things, then we can probably also remember a string constant.

Also, there are genuine cases where the POST storage is used as - storage.

There are good use cases for storage, but there are also others that don't fit so well. One-size-fits-all isn't necessarily the best solution where the average compressed data size is around that of a hash.

Given that the size of the vast majority of these requests is less than a couple of kBs, the impact on storage is minimal.

The graph definitions are indeed fairly small, but they are large enough to warrant spending time on special preview stashing functionality. This does have a complexity cost in the hundreds of lines of code, which I think we shouldn't pay where this complexity isn't absolutely needed.

Additionally, the majority of POST storage requests fall in the < 2kB category

We have established that this is true for math (with compression), but is this also true for graphoid? What is the mean / maximum size of a graphoid graph definition, and how quickly will those graph definitions change?

It might well be that storing those graphs forever is no problem, but I think it would be good to have at least a rough bound on the amount of data we are talking about per year.

From IRC: [12:38] <yurik> gwicke, out of 23.5 thousand graphs we have at the moment, average size is 1077, and the average compressed (each with gzip) its 467. The largest are 26506 (3253 compressed)

Using base64 on the gziped data gives average of 634 bytes

Results for the full graphoid graph dataset:

{
  "urlenc": {
    "avg": 2085.4786786276095,
    "sum": 49052544,
    "max": 52647
  },
  "raw_json": {
    "avg": 1077.60103737086,
    "sum": 25346254,
    "max": 26030,
    "of_urlenc": 0.5167164010902269,
    "of_json": 1
  },
  "nodict": {
    "avg": 392.2211640661537,
    "sum": 9225434,
    "max": 3040,
    "of_urlenc": 0.18807248814658828,
    "of_json": 0.36397623096493864
  },
  "nodict_b64": {
    "avg": 524.2957357255219,
    "sum": 12331960,
    "max": 4056,
    "of_urlenc": 0.25140306688272884,
    "of_json": 0.48653974666236677
  },
  "dict": {
    "avg": 259.8412907614472,
    "sum": 6111727,
    "max": 2892,
    "of_urlenc": 0.12459551537225062,
    "of_json": 0.24112939923982454
  },
  "dict_b64": {
    "avg": 347.7969474086986,
    "sum": 8180532,
    "max": 3856,
    "of_urlenc": 0.16677079989979723,
    "of_json": 0.32275112527476446
  },
  "dict_b64_min": {
    "avg": 338.79052761362186,
    "sum": 7968692,
    "max": 3848,
    "of_urlenc": 0.1624521655798321,
    "of_json": 0.314393282731247
  }
}

Just for fun, here are the results when using the math dict (full of latex, no JSON) to compress graphoid graphs:

 "graphs_using_nodict": {
   "avg": 392.2211640661537,
   "sum": 9225434,
   "max": 3040,
   "of_urlenc": 0.18807248814658828,
   "of_json": 0.36397623096493864
 },
"graphs_using_math_dict": {
   "avg": 395.66374728965604,
   "sum": 9306407,
   "max": 3038,
   "of_urlenc": 0.18972322821829588,
   "of_json": 0.36717090422908255
 }
 "graphs_using_graph_dict": {
   "avg": 259.8412907614472,
   "sum": 6111727,
   "max": 2892,
   "of_urlenc": 0.12459551537225062,
   "of_json": 0.24112939923982454
 }

So, using a completely bogus dict doesn't help, but also doesn't hurt.

So yes, a wrong dictionary does add to the average size, but not very much - about 0.9%. Regardless, I feel we are going the wrong route - we are increasing the page size, whereas we should be decreasing it. These extra kilobytes add up very quickly, especially on mobile, especially when uncompressable.

My understanding of the goals

  • A user should be able to view an old page revision with all additional resources (e.g. graph) - as of the moment when the newer page revision was added
  • For the "HEAD" revision of the page, and ONLY for it, the page should be in a flux state - changes to any templates, or external resources such as images, should be reflected in the HEAD revision automatically.
  • We want to optimize both storage and cache, so that if two revisions contain a reference to an identical resource, that resource is only stored once, and has the same URL to improve browser and server-side caching.

So it seems we need two mechanisms - one to keep the HEAD revision up to date (null edits), and another is archiving - to preserve whatever state the current HEAD, plus all referenced resources/images/graphs/etc before adding a new revision. It is ok if there was a rendering error in an old revision - because that is an accurate representation of the past. TBD: mechanism to manually delete old images due to legal reasons.

I don't think this is relevant anymore.