Page MenuHomePhabricator

Batch Parsoid's API requests
Closed, ResolvedPublic0 Estimated Story Points

Description

We should amortize per-request overheads by batching API requests.

AFAIK there is no generic batching support in the API currently, and adding it would probably be a bit too time-consuming for now. Instead we can add a unique string as a separator, for example something like

<nowiki>d41d8cd98f00b204e9800998ecf8427e</nowiki>

This would work for for action=expandtemplates and action=parse which are the main workarounds we currently use. In the longer term we should switch to explicit methods that don't involve parsing wikitext, which is probably also a good moment to add real batching support.

Decisions about batch sizes could be based on wikitext source size initially (based on the assumption that templates with a bazillion parameters also take longer to expand). A fixed number of templates per batch would be another simple alternative. Really fancy batching could use stats of previous processing times (returned by the PHP preprocessor per transclusion and stored in HTML).

Ideally we would also enforce separation between batched requests to avoid non-deterministic behavior for stateful extensions.


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:37 AM
bzimport set Reference to bz43888.

action=query was built specifically to support batching, but other actions might choose not to implement it. Could you give a very specific example of

  • How the final api request should look like?
  • What do you expect in return.

For us, the main use cases are template expansion using the PHP preprocessor and the expansion of extension tags. Currently, both of these are only available with a detour through the PHP parser (action=expandtemplates and action=parse respectively), so we'd like to add direct API end points to perform these actions.

A general batching mechanism would probably accept a posted JSON array with each member being an object containing the API arguments for one action (action etc). The return value would be a JSON array with results corresponding to the request array. Instead of relying on positions in an array the client could also supply UIDs for requests which are then mirrored in the result. This could be achieved with an reqid member in each request object.

The problem with general batching is that state in MediaWiki is not very well defined. Several extensions for example keep internal state between calls to tag hooks, and resetting that state between individual requests in a batch would probably produce a similar overhead as the regular per-request setup (30+ms).

In Parsoid, we can work around this issue for our specific use cases by maintaining a whitelist of stateless extensions which can be mixed in a batch without producing non-deterministic output. Non-whitelisted extensions would need to be expanded in a single action=parse request containing all extension tags for that particular extension in page order.

Adding custom batching support just for template expansion and extension tag calls would really be a workaround, and would add complexity to both clients and the server. If the performance of API requests could be improved in general, we could and should probably avoid that complex workaround in favor of the general solution.

In the longer term, our HTML storage and incremental re-parsing / expansion plans should reduce the number of template expansions and extension calls to a fraction of the level we'll produce initially. The round-trip testing we are currently performing is already close to the edit rate of the English Wikipedia and did not create any issues on the API cluster. A Parsoid installation tracking all edits on all wikis would probably create 2-3 times the number of API requests, which are probably still doable for the current API cluster.

Overall it seems to be a good idea to hold off on a custom batching solution until it is clear that it is really needed and general API performance improvements are not possible.

The current performance numbers combined with several optimizations aimed at avoiding API requests lessened the need for API request batching on our end.

Speeding up individual API requests is a general architectural goal, and should be addressed in that context. Closing this bug as WONTFIX on our end.

tstarling subscribed.

Request batching appears to be simple to implement (~100 lines according to @GWicke) and would cause a substantial decrease in API cluster CPU load. By switching it on only for requests initiated by the job queue, there would be no effect on user latency.

The problem with general batching is that state in MediaWiki is not very well defined. Several extensions for example keep internal state between calls to tag hooks, and resetting that state between individual requests in a batch would probably produce a similar overhead as the regular per-request setup (30+ms).

No, parser state in MediaWiki is very well-defined, it is defined as the data which is cleared by Parser::clearState(). Extensions which keep internal state hook this function in order to clear their state. MediaWiki is reasonably well-optimised for clearing the parser state, since message parsing does so, often tens of times per request. The overhead is orders of magnitude smaller than the request setup time.

The current performance numbers combined with several optimizations aimed at avoiding API requests lessened the need for API request batching on our end.

Parsoid is still by far the largest API consumer by CPU time.

Speeding up individual API requests is a general architectural goal, and should be addressed in that context. Closing this bug as WONTFIX on our end.

I am familiar with MW setup time optimisation, and compared to that large, ongoing effort, Parsoid request batching looks like low-hanging fruit.

The two main implementation options are: a batcher internal to parsoid, and a shared node.js batcher service. The internal batcher would operate roughly as follows:

  • Instead of having the token handlers directly dispatch HTTP requests, they would send their requests to a batcher service, which would be local to the current parse operation. The batcher API would be slightly different to the HTTP API, since it needs to be able to easily aggregate titles, and the batcher will take over responsibility for HTTP protocol details such as timeouts.
  • If the batch is full, the request can be dispatched on the next attempted queue insertion.
  • If Parsoid returns to the event loop with requests queued in the batch but not sent, the request could permanently hang. So in relevant code paths prior to returning to the event loop, Parsoid needs notify the batcher, giving it a last chance to check if there are any batch requests queued so that it can dispatch them.
  • Additionally, a timeout would be prudent, allowing Parsoid to recover from accidental return to the event loop without notification.

Gabriel suggested an external batcher which, if I understand his proposal correctly, would operate as follows:

  • One cluster-wide batch queue per (wiki, action) tuple.
  • Requests to the proxy would be held in the batch queue until either it fills, or some specified timeout (e.g. 10ms) is reached. Then the batch query would be sent to the MW backend.
  • While a request is waiting in the batch queue, and while it is waiting for the backend response, the client connections, one per item, would be held open. Responses would be sent to these clients after the batch query returns from the backend.
  • This architecture avoids the need for significant changes inside Parsoid.

The main reason I am leaning towards the internal batcher is because some of the queues will have very little traffic on them, e.g. queues for small wikis. This makes for some uncomfortable tradeoffs when you try to set the dispatch timeout. Consider that parse operations are observed to take ~10s and issue ~100 API requests -- this implies a mean interval between requests on the order of 100ms, and a timeout of at least that much if there is to be any throughput benefit from batching at all. That is quite a long time to wait. From a throughput perspective, the optimal dispatch timeout may be more than 1s.

Explicit notification of the idle event avoids severe tradeoffs between batch size and latency. It can be implemented in either the internal or external design. But the need for idle event notification would reduce the attractiveness of the external batcher -- it would make the service stateful and its advantages in simplicity on the Parsoid side would be reduced. There is a risk that idle events could be lost anyway, and a timeout incurred, due to exhaustion of the client connection limit (globalAgent.maxSockets).

Also, note that Parsoid currently performs API requests with keep-alive disabled. With default socket options, this implies a limit of 470 API requests per second per client server before local port exhaustion occurs, as in T75949. Batching internally, without the involvement of TCP, avoids this limit.

ssastry set Security to None.
ssastry moved this task from Needs Triage to Performance on the Parsoid board.

The main reason I am leaning towards the internal batcher is because some of the queues will have very little traffic on them, e.g. queues for small wikis. This makes for some uncomfortable tradeoffs when you try to set the dispatch timeout. Consider that parse operations are observed to take ~10s and issue ~100 API requests -- this implies a mean interval between requests on the order of 100ms, and a timeout of at least that much if there is to be any throughput benefit from batching at all. That is quite a long time to wait. From a throughput perspective, the optimal dispatch timeout may be more than 1s.

The amount of batching we can achieve on the long tail is bounded by the maximum latency penalty we are willing to introduce. A global batcher can combine requests from different API clients (not just parsoid), and should thus result in lower average latencies by filling up batches quickly (or reaching a given batching ratio with a lower maximum latency penalty).

Also keep in mind that with selective re-expansion of templates those requests will become less common, which will make the single-client batch even less effective, as there are just less requests to combine within the time window.

Also, note that Parsoid currently performs API requests with keep-alive disabled. With default socket options, this implies a limit of 470 API requests per second per client server before local port exhaustion occurs, as in T75949. Batching internally, without the involvement of TCP, avoids this limit.

The number of connections has not been an issue in practice & would not change. On the proxy we could also enable SPDY support, which reduces the number of connections to one per client & removes some per-request overhead. That said, I think the Connection: close bit can be removed. It was a work-around from an early node version IIRC, and is no longer needed.

Jdforrester-WMF renamed this task from Batch API requests to Batch Parsoid's API requests.Feb 10 2015, 11:25 PM

It is good thing to reduce the load on the API cluster, but I don't think this task should be a VE Q3 blocker.

Right now,
(a) Parsoid tracks all edits -- so there is going to be no change in Parsoid parsing load (and hence API load) even if VE is enabled everywhere (assuming edit rate stays the same).
(b) The current load on the API cluster is not problematic that forces us to work on this for VE deployment.

So, I suggest that this task be removed from the VisualEditor Q3 Blockers workboard since there are other more higher priority issues to fix in Parsoid land.

We should also keep in mind the effect of batching on overall system reliability. In T97204: RFC: Request timeouts and retries I advocate for tight timeouts in backend services, combined with no client retries when encountering a server timeout. In general, batching makes it harder to establish tight timeouts on end points. We should thus limit it to end points where each batch action has a very predictable processing time, and the overall batch will always complete within a reasonable timeout (< 60s).

I've written a client-side "batcher" class hierarchy, and a MediaWiki extension which does server-side batches of mixed parse and preprocess operations. The impact on both client and server CPU time is substantial, as expected from the discussion on T88915. For [[List of 20th-century classical composers]] with a batch size of 50, node CPU time went from 54s to 29s, and HHVM CPU time went from 216s to 70s.

The impact on latency can't easily be measured in my current test setup.

Unfortunately, it was necessary to treat each item in the batch as a distinct parse operation, with clearState(), since Parsoid wishes to know which categories are generated by which templates. This causes a number of useful caches to be discarded at the start of each batch item. This may be the reason why the HHVM CPU time is still 70s, even though running the text through expandtemplates as a whole takes only 10s. Increasing the batch size from 50 to 500 only reduced the HHVM CPU time to 65s.

If clearState() was not done, it would not be easy to determine which template caused the addition of a given category. Consider the case of a template which is invoked with no arguments, which causes a tracking category to be added. If this template is invoked more than once in a parse operation, the expansion cache will prevent ParserOutput::addCategory() from being called on the second and subsequent invocations. This implies that the expansion cache would need to be disabled or rewritten to support this Parsoid use case.

Please comment on the idea of doing batch parsing in an extension instead of in a core API module. I'd like to know if it sounds like a good idea before I create the Gerrit repository.

The reasons for doing this were:

  • Simpler design, due to the smaller number of clients/stakeholders. Code size is only 135 physical lines, with no internal API calls, compared to 800 for action=parse and 218 for action=expandtemplates.
  • Allows the action=parse and action=expandtemplates modules to be easily combined. No need to break my head trying to think about generic Lua frontend batch processing APIs which allow arbitrary actions to be combined.
  • An API for parsoid use only can easily be policed for misuse or DoS, e.g. with IP-based access control. No need for restrictive server-side batch size limits which may interfere with optimal performance.
  • Expedient output format, based on b/c with action=parse and action=expandtemplates, which allows the client to share code with the fallback case.
  • Forking the API provides an easy path towards more aggressive integration with the MW parser, for example output annotation hooks.

The proposed extension name is ParsoidBatchAPI.

No response, so I committed the MW extension to https://gerrit.wikimedia.org/r/#/c/226670/

Still ironing out a couple of issues on the client side. A changeset will appear soon.

@tstarling +1 from me. Your rationale covers all the things I would have brought up (and more). Sorry for being slow, @ssastry is vacationing and I'm not sure if @cscott's been following this thread.

Sounds good to me since it looks like it keeps complexity down implementing this since the code can be customized to Parsoid's needs. But, once this is all ironed out and live, we should update docs for 3rd party Parsoid installs about requirement (unless batching can be made optional on the Parsoid end).

Change 227208 had a related patch set uploaded (by Tim Starling):
Batch MW parser and imageinfo API requests

https://gerrit.wikimedia.org/r/227208

Change 227208 merged by jenkins-bot:
Batch MW parser and imageinfo API requests

https://gerrit.wikimedia.org/r/227208

Batching was fully enabled on Sept 29th, around 16:14 UTC. 12-hour median latencies as seen by RESTBase in the last days (smoothed version of this graph):

P99:

pasted_file (1×1 px, 267 KB)

Mean:

pasted_file (1×1 px, 278 KB)

Smoothed parse request rate:

pasted_file (1×1 px, 202 KB)

You need to account for the spikiness in the load too. see http://grafana.wikimedia.org/#/dashboard/db/parsoid-timing-wt2html (zoom out to a 7-day window and look how the request rate was much closer to 200 reqs / sec last few days compared to < 150 reqs / sec earlier in the week (before batching was enabled). I am fairly confident that parsing latencies haven't been affected adversely based on monitoring this last couple days. But, I think we have to wait and watch this over the next couple weeks before we have a truer picture.

On the other hand, if you look at ganglia graphs for the API cluster, you do see a reduction in network traffic in/out of it (despite the higher parse load on the Parsoid end) + a downward trend in the load there.

@ssastry, good point re load. Added the parse request rate for completeness. Also agreed re needing more data.

Arlolra assigned this task to tstarling.
Arlolra removed a project: Patch-For-Review.