Page MenuHomePhabricator

Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degrade performance in low-resource contexts.
Open, MediumPublic

Description

For Parsoid read views, we need to evaluate performance on a few different axes. One of them is HTML size. Parsing, CPT (now known as PET), and Performance Teams had a meeting in July 2020 and the notes of that conversation are captured in https://www.mediawiki.org/wiki/Parsing/Parser_Unification/Performance#HTML_output_size.

This phab task is to follow up on that conversation.

A priori, we know that Parsoid HTML is going to be bigger than core parser HTML that is currently being served. Parsoid HTML has a bunch of additional information relative to the core parser HTML (see https://www.mediawiki.org/wiki/Specs/HTML for Parsoid's output HTML spec)

  1. data-parsoid attribute -- currently stripped and stored separately in storage / cache and not shipped to clients
  2. data-mw attribute -- currently shipped to all clients but the plan is to strip them from the HTML and stored separately (and for editing clients to fetch them out-of-band on-demand). See T78676: Store & load data-mw separately.
  3. rel="mw:." attributes on links
  4. typeof=".." attributes on template, extension, and media output
  5. id=".." attribute on all DOM nodes (for indexing into a JSON blob that stores data-parsoid & data-mw attributes offline)
  6. <section> wrappers
  7. marker nodes for various rendering-transparent content in wikitext (interlanguage links, category links, magic word directives, etc.)

We know from usage that editing and various other clients rely on some or all of this information. Parsoid's value proposition is precisely this and other semantic information that insulates all MediaWiki content consumers from having to know much about wikitext parsing to understand and manipulate wiki content while respecting wikitext semantics.

But, as the above list makes clear, that value for various clients comes at a potential HTML size cost.

This task is to evaluate the performance implications of this wrt read views.. I imagine some or all of the following would be needed:

  • Identify a suitable benchmark set of pages (or repeatable methodology to build such a set of pages if we want to periodically reassess decisions made now) across production wikis.
  • Given the above benchmark set, assess the HTML size impact of the extra information Parsoid adds for all items in the above list except for items 1 & 2. There is value to doing this analysis for items 1 & 2 as well for purposes of HTML storage / cache costs, but is otherwise tangential to the primary focus of this task. The tests should be done so that we can evaluate the impact on both raw as well as gzipped (or whatever compression scheme is used for network payloads) output.
  • Come up with anticipated performance impacts given the results from the performance tests above.

Based on these results, we will probably have to identify what what mitigation actions are needed, if any. This next step is likely going to be an evaluation of performance tradeoffs. As observed earlier, given its utility to various downstream Wikimedia content consumers, Parsoid's raw HTML (what we emit today) is unlikely to change significantly, and the API endpoints might continue to serve that unmodified HTML as well. Given this, Parsoid's raw output would need to be cached (for performance reasons).

So, given the above, the performance tradeoffs are likely going to be one of trading CPU time against storage:

  • Strategy 1: Ship Parsoid's raw HTML as is without any stripping: This has an impact on network transfer times - the specific numbers depends on the evaluation results from earlier.
  • Strategy 2: Post-process Parsoid's raw HTML before shipping: This would be a HTML2HTML transform (like many others like mobile-html, mobile-sections, language-variants, etc.) If we are doing this, we can do as aggressive of a post-processing as we want. If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests. There is a tradeoff here around total network transfer across all Parsoid HTML requests (not just readviews). In addition, there are two specific sub-strategies available here:
    • Strategy 2a: Cache the post-processed HTML in ParserCache: This has a storage cost but enables fast read views. Note that Varnish / CDN benefits are on top of this.
    • Strategy 2b: Post-process the HTML on demand for all requests: This cuts storage costs in half but adds loads on servers that post-process Parsoid HTML. Note that we still benefit form Varnish / CDN caching.

Strategy 1 would be the lowest cost approach from the Wikimedia infrastructure point-of-view but can penalize all downstream consumers that pay for bandwidth / data costs.

So, the end-result of this performance evaluation would be to arrive at a bunch of recommendations for the best way to serve read views that meets a set of performance goals (something to be established in consultation with different stakeholders).

Related performance tasks / discussions:

Related Objects

Event Timeline

I think that the performance tests will need to be using HTML from real articles, with properly rendered templates, images, etc. With variants for each scenario. With a set of pages like that (that can be static), we can run this through our synthetic testing platform for in-depth analysis, including running those tests on real underpowered devices on our mobile device lab, since they are more likely to have HTML download and parsing as a bottleneck than simulated devices.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

Yes, that was my implication. I understand what you are saying is that browser HTML isn't trustworthy since it might have been manipulated. But, I vaguely remember that ideas like service-workers were considered to work around that. But, anyway I don't have enough information at this point to say whether that is feasible or not. I'll let the Editing Team (@Esanders) or others involved in building editing clients chime in on it so it can be factored into whatever solution we end up with.

ssastry renamed this task from Evaluate Parsoid HTML size from a performance POV for serving read views to Ensure Parsoid HTML served for read views doesn't degrade performance in low-resource contexts.Jan 19 2022, 6:55 PM

FWIW, @Esanders and the editing team are pretty confident they can avoid corruption of HTML once received by a client, but it would involve something like the service worker approach -- basically as long as the received HTML is not linked into window.document, they can keep it out of the hands of gadgets. A similar approach might be to checksum the "expected" HTML so that "well behaved clients" can reuse the read-view HTML for edit view. For poorly-behaved clients the checksum will fail and they'll have to reload the edit view HTML from the server, but this will still reduce download bandwidth and latency (although current RESTbase responds faster than the action API requests performed at the initiation of editing) for the vast majority of clients.

That said, I think the primary motivation here is to have a mapping between read/edit/etc views that is "as simple as possible". The primary use case is: the reader clicks on something in the edit view of the page (a section edit link, a reply, a particular thing they want to edit, etc). We need to be able to take the (WLOG) xpath to the thing they clicked on and map it to the (WLOG) xpath of the corresponding element in the edit view.

So there needs to be a coherent mechanism for what is allowed/disallowed in the transformation so that (a) this mapping can always be reconstructed/stored, (b) we avoid the case where we have an explosion of subtly-different versions of the HTML. (Putting (b) another way, let's assume we need to store an explicit xpath-to-xpath mapping between every different version of the HTML. If there's just "a read view" and "an edit view", that's only one map we need to generate and store. If we've got "desktop read view", "mobile web read view", "mobile app read view", "edit view", "mobile app edit view" now we've got 16 different mappings to potentially deal with.)

One strawman version of this is that we add/keep an ID tag on every element. Then you can make pretty arbitrary changes to the HTML including stripping elements, altering structure, etc, but you can just match IDs across versions.

An alternate strawman says "any changes are permissible but the structure must not change" -- that is, the xpath to a given node must be preserved between versions. Attributes can be freely deleted, but if you delete a node you *must* leave a placeholder node in its place so that the tree structure is unchanged.

There are certainly middle grounds between these extremes. The most flexible is to keep a full map between the xpaths in the different versions, although with a "rehydration document" which specifies how to undo every change that was made. So when you click node X to edit it, you do a query to the server to map the xpath of X to Y and (assuming your read view HTML still matches the expected checksum) fetch the rehydration document, which will be all the information you need to reconstruct the edit-mode HTML. But even there it would probably be helpful if the entries in the rehydration document were of a limited # of types (which implicitly is then also the set of "things you can do to strip down the edit view HTML").

Change 762459 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Benchmark read view content stripping

https://gerrit.wikimedia.org/r/762459

Change 762459 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Benchmark read view content stripping

https://gerrit.wikimedia.org/r/762459

We run the following benchmark on rt-testing:

Here are the results from a run (commit hash 0352aa34):

image.png (340×605 px, 20 KB)

Change 791051 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Read view benchmark: Add metrics from MW parser

https://gerrit.wikimedia.org/r/791051

Change 791051 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Read view benchmark: Add metrics from MW parser

https://gerrit.wikimedia.org/r/791051

Change 791581 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Read view benchmark: Disable profiler report on MW results

https://gerrit.wikimedia.org/r/791581

Change 791581 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Read view benchmark: Disable profiler report on MW results

https://gerrit.wikimedia.org/r/791581

Change 792236 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Change 792236 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Here are the results based on the last rt-testing run (as HTML export):

ssastry renamed this task from Ensure Parsoid HTML served for read views doesn't degrade performance in low-resource contexts to Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degreate performance in low-resource contexts..May 23 2022, 10:06 PM

I edited the task name to better relfect the description. I think we are close to arriving at a recommendation based on @Jgiannelos work on this so far. I'll create other related (parent and sibling tasks) to reflect the full scope of work to be done for read views wrt HTML payload.

ssastry renamed this task from Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degreate performance in low-resource contexts. to Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degrade performance in low-resource contexts..May 24 2022, 2:27 AM

TODO

  • Log snippets of what we strip to eyeball whats there
  • Figure out why the numbers of data-mw and typeof don't add up