Page MenuHomePhabricator

Incrementally parse and convert API response as data comes in
Open, MediumPublic

Description

XMLHttpRequest2 let you read partial response data in progress event handlers. Libraries like Oboe.js implement incremental JSON decoding to make it possible to process the data from an AJAX request as it comes in. We should investigate the possibility of building the linear DOM incrementally as data comes in, rather than blocking on request completion.

Event Timeline

ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added a project: VisualEditor-Performance.
ori added subscribers: ori, Catrope.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Oboe.js doesn't appear to support incremental decoding within a single string in the JSON it's decoding. That would be required, because the bulk of the JSON response is a single string containing the Parsoid HTML. Additionally, we would have to be able to incrementally parse that HTML string into an HTML DOM, which also sounds tricky (and sounds like something we couldn't use the browser's own HTML parser for easily).

If you have an incrementally building HTML DOM, then incrementally building the linmod from that is not that hard in theory, since it's all a single pass in order, except for the new data-mw.id thing for references which is a forward reference (but that could probably be solved, especially with my backburnered DocumentSet work).

Alternatively, if we decided to go this route, we could have ApiVisualEditor.php transform the HTML string into a DOM-like JSON structure (using an XML parser) and stream that through Oboe.js. I wonder if that would be worth it, since Clarinet (which Oboe.js is based on) claims to be ~10x slower than JSON.parse(). But I suppose that for small pages we might not notice, and for large pages the idle time we spend waiting on the network might be significant.

Incrementally parsing HTML contents from a string inside JSON seems impractical indeed. I think we should at least fetch the HTML from its own url if we're going this route. (Or build our own model that e.g. starts with a bit of JSON, followed by a line break and the HTML).

We could use SAX with an HTML parser instead of JSON (if that exists).

There do exist popular XML parsers for SAX. We may be able to use that since the Parsoid HTML is XML compatible, right?

I boldly asked #whatwg to expose the stream parser browsers already have for the main request. While this seems extremely unlikely (and it is, for other reasons) it shouldn't be that complicated given that the parsing logic is very detailedly specified. Even with regards to gradual building of the DOM.

Anyhow, a few people there, also recommended SAX. One package in particular came up: https://github.com/inikulin/parse5

Incrementally parsing HTML contents from a string inside JSON seems impractical indeed. I think we should at least fetch the HTML from its own url if we're going this route.

Ori and I decided we'd do exactly that yesterday. I haven't filed a task for that yet, but I will.

We could use SAX with an HTML parser instead of JSON (if that exists).

There do exist popular XML parsers for SAX. We may be able to use that since the Parsoid HTML is XML compatible, right?

Yes, Parsoid HTML is XML-compatible, and we already use an XML parser to work around normalization bugs in IE's HTML parser, but only in IE. Even there, we have to serialize back to HTML and re-parse with an HTML parser, because while the string we receive is both valid XML and HTML, XML and HTML have different parsing rules and the two DOMs behave differently, and we do need to produce an HTML DOM in the end. However, you wouldn't need the fancy HTML5 parsing algorithm, you could potentially use SAX but emit HTML nodes instead of XML nodes, without having to worry about the fact that the resulting parser isn't technically HTML-compliant due to its inability to parse non-XML-compatible tag soup.

There are two reasons, though, why I'm not convinced that we want to do this just yet. One is that, currently, the request time is dominated by TTFB (time to first byte), and the actual data transfer is fast. We would have to see how this changes when we request the HTML separately and directly from RESTbase. If transmission is the dominant factor and TTFB is very low, a streaming parser could make sense, but if the TTFB dominates, there isn't much of a point. My other concern is that not using the browser's native HTML parser makes me feel uneasy.

I boldly asked #whatwg to expose the stream parser browsers already have for the main request. While this seems extremely unlikely (and it is, for other reasons) it shouldn't be that complicated given that the parsing logic is very detailedly specified. Even with regards to gradual building of the DOM.

That would be awesome, and I think it's pretty clear that that's the best outcome for everyone (as in, the web) in the long run.

Jdforrester-WMF set Security to None.