Page MenuHomePhabricator

Consider opening up POST /media/math/{format} to external users
Closed, ResolvedPublic

Description

Current Status

As of T102030: Document and hook up public mathoid end point in RB, RESTBase is handling, proxying and storing POST requests for/to Mathoid. The logic is the following;

  1. Clients do a POST request to RESTBase at https://{domain}/api/rest_v1/media/math/{format} with the request body being as if they're querying Mathoid directly.
  2. RESTBase checks if the request body has been encountered before (stored). If so, the result is directly returned to the client.
  3. RESTBase does a POST request to Mathoid which renders the given formula
  4. RESTBase stores the complete result (containing all available formats)
  5. The desired render is returned to the client, together with the X-Resource-Location header which contains the hash referencing the request
  6. On subsequent requests, the client uses GET /media/math/{format}/{hash} to obtain the same formula rendered in a different format. Such requests are served exclusively by RESTBase.

The rationale here is that the Math extension would use the POST /media/math/{format} endpoint to obtain a first mathml render of a formula on a page save and incorporate GET /media/math/{format}/{hash} calls into the page for fall-back calls of other formats.

Problem

The current set-up assumes that both RESTBase and Mathoid are installed on the same network as the Math extension, so access to the POST endpoint is restricted to internal IPs only by RESTBase. However, third-party users usually lack these services, which renders the usage of the extension rather pointless. Opening up access would allow:

  • third-party users to use the Math extension and simply point it to use WMF's production RESTBase; and
  • WMF to host a (possibly-) comprehensive catalogue of mathematical formulae.

The worrying aspect of this move is security and stability. RESTBase may sustain a much much higher request rate than Mathoid, which needs around a couple of seconds to serve one request and is hosted on only two machines. Therefore, it can be (easily) saturated, especially when swamped with invalid or erroneous requests.

Solution

In order to keep things up and running, RESTBase would need to limit the requests actually reaching Mathoid. In normal operation, the number of requests naturally decreases over time as RESTBase stores more and more renders. However, we need a way to protect Mathoid against attacks. RESTBase could:

  • Check all of the requests and ensure they conform to the endpoint's specification. If the request doesn't adhere to it, reject it automatically.
  • Limit the size of request body data to 16kB. No formula to be rendered on-wiki should be longer than that. We can probably even set a lower limit. If the request's size is larger than that, we assume the request is erroneous.
  • Rate-limit the endpoint, probably could be part of T107934: Reliable and scaleable rate limiting mechanism for RESTBase API entry points. The exact number is yet to be determined, but the logic is that there are 2*32 Mathoid processes accepting requests. If each of them takes a couple of seconds, then we can quickly come to a rough 32 requests per second and assume that anything above that would start a backlog of requests. Each process having a backlog of 3 to 4 requests should be fine (both in terms of memory and processing power), so the rate could be set to 100 req/s.
  • Create a new endpoint in Mathoid which would only check the correctness of a request and return the appropriate status code (2xx, 4xx). The logic would then be slightly more complicated in that when a request comes to RESTBase and its body is not known to it, Mathoid would be called to inspect the request. If it's a legitimate one, the body data would be stored and only then would Mathoid render the formula contained in it.

Discussion

Thoughts and comments are welcome on the following questions:

  1. Should we open up the POST route?
  2. If so, how can we ensure its stability?

See Also

Event Timeline

mobrovac raised the priority of this task from to Medium.
mobrovac updated the task description. (Show Details)

To make sure I understand this, this proposed change would disable restbase caching, and pass the request straight through to Mathoid?

A global rate limit would be ok, although something more like poolcounter / pinglimiter to rate limit per resource and per IP would be better. That infrastructure restbase should probably implement at some point (or we should have a service for it available to the cluster).

Currently the Math extension works as follows:
At first it calls the $renderer->checkTex() function that checks if the input is valid.
At the moment, there is only one implementatino of MathInputCheck (i.e., MathInputCheckTexvc), which shells out to texvcchek.
I volunteer to contribute to a second implementation of MathInputCheck that uses Restbase. This would currently be limited to tex and inline-tex input format. But I think this is no problem for the time being.
The newly created globally open restbase texinfo post endpoint would than output the HASH that is required to GET the MathML or SVG output.
texvcjs (that is the service that answers the request in the background) is two orders of magnitude faster than mathjax-node (the service that does the SVG rendering).
Thus, this would leverage performance doubts and enhance security and robustness.

To make sure I understand this, this proposed change would disable restbase caching, and pass the request straight through to Mathoid?

Everything would work just as it does now. The change here would be to expose the POST endpoint to external IPs as well; currently only internal IPs are allowed to use the endpoint try https://en.wikipedia.org/api/rest_v1/?doc#!/Math/post_media_math_format - you'll get a 403, we'd like to actually perform that request when external clients (i.e. those outside the WMF infra) make a request to it.

Currently the Math extension works as follows:
At first it calls the $renderer->checkTex() function that checks if the input is valid.
At the moment, there is only one implementatino of MathInputCheck (i.e., MathInputCheckTexvc), which shells out to texvcchek.
I volunteer to contribute to a second implementation of MathInputCheck that uses Restbase. This would currently be limited to tex and inline-tex input format. But I think this is no problem for the time being.

That could be a good way forward.

The newly created globally open restbase texinfo post endpoint would than output the HASH that is required to GET the MathML or SVG output.

I don't understand this. The hash needs to be computed by RESTBase.

I don't understand this. The hash needs to be computed by RESTBase.

I understand that the hash needs to be computed by RESTBase. There would be only one world writeble RESTBase endpoint. For example /media/math/texinfo which would just return the result of the examination of the tex input string, but also the hash that was calculated by RESTBase in the header.
This hash can thereafter be used to get the MathML and SVG payload without to use POST requests again.

Ah, I see where I misunderstood your initial post. Makes sense.

So yes, as Physikerwelt noted, the tex needs to be validated beyond just length and endpoint specification. Otherwise someone can give us tex that executes arbitrary shell commands.

If /media/math/texinfo should be accssible from the outsite we

  1. have to enable it in production https://gerrit.wikimedia.org/r/#/c/252429/
  2. expose texvcinfo to restbase

If /media/math/texinfo should be accssible from the outsite we

  1. have to enable it in production https://gerrit.wikimedia.org/r/#/c/252429/
  2. expose texvcinfo to restbase

These have been addressed, so I created PR #490 that opens up the POST /media/math/check end point to the public. It is safe to do so, because it only allows clients to check the input TeX formula; it performs no rendering.

mobrovac claimed this task.

Deployed, so closing this one.