Page MenuHomePhabricator

Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degrade performance in low-resource contexts.
Open, MediumPublic

Description

For Parsoid read views, we need to evaluate performance on a few different axes. One of them is HTML size. Parsing, CPT (now known as PET), and Performance Teams had a meeting in July 2020 and the notes of that conversation are captured in https://www.mediawiki.org/wiki/Parsing/Parser_Unification/Performance#HTML_output_size.

This phab task is to follow up on that conversation.

A priori, we know that Parsoid HTML is going to be bigger than core parser HTML that is currently being served. Parsoid HTML has a bunch of additional information relative to the core parser HTML (see https://www.mediawiki.org/wiki/Specs/HTML for Parsoid's output HTML spec)

  1. data-parsoid attribute -- currently stripped and stored separately in storage / cache and not shipped to clients
  2. data-mw attribute -- currently shipped to all clients but the plan is to strip them from the HTML and stored separately (and for editing clients to fetch them out-of-band on-demand). See T78676: Store & load data-mw separately.
  3. rel="mw:." attributes on links
  4. typeof=".." attributes on template, extension, and media output
  5. id=".." attribute on all DOM nodes (for indexing into a JSON blob that stores data-parsoid & data-mw attributes offline)
  6. <section> wrappers
  7. marker nodes for various rendering-transparent content in wikitext (interlanguage links, category links, magic word directives, etc.)

We know from usage that editing and various other clients rely on some or all of this information. Parsoid's value proposition is precisely this and other semantic information that insulates all MediaWiki content consumers from having to know much about wikitext parsing to understand and manipulate wiki content while respecting wikitext semantics.

But, as the above list makes clear, that value for various clients comes at a potential HTML size cost.

This task is to evaluate the performance implications of this wrt read views.. I imagine some or all of the following would be needed:

  • Identify a suitable benchmark set of pages (or repeatable methodology to build such a set of pages if we want to periodically reassess decisions made now) across production wikis.
  • Given the above benchmark set, assess the HTML size impact of the extra information Parsoid adds for all items in the above list except for items 1 & 2. There is value to doing this analysis for items 1 & 2 as well for purposes of HTML storage / cache costs, but is otherwise tangential to the primary focus of this task. The tests should be done so that we can evaluate the impact on both raw as well as gzipped (or whatever compression scheme is used for network payloads) output.
  • Come up with anticipated performance impacts given the results from the performance tests above.

Based on these results, we will probably have to identify what what mitigation actions are needed, if any. This next step is likely going to be an evaluation of performance tradeoffs. As observed earlier, given its utility to various downstream Wikimedia content consumers, Parsoid's raw HTML (what we emit today) is unlikely to change significantly, and the API endpoints might continue to serve that unmodified HTML as well. Given this, Parsoid's raw output would need to be cached (for performance reasons).

So, given the above, the performance tradeoffs are likely going to be one of trading CPU time against storage:

  • Strategy 1: Ship Parsoid's raw HTML as is without any stripping: This has an impact on network transfer times - the specific numbers depends on the evaluation results from earlier.
  • Strategy 2: Post-process Parsoid's raw HTML before shipping: This would be a HTML2HTML transform (like many others like mobile-html, mobile-sections, language-variants, etc.) If we are doing this, we can do as aggressive of a post-processing as we want. If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests. There is a tradeoff here around total network transfer across all Parsoid HTML requests (not just readviews). In addition, there are two specific sub-strategies available here:
    • Strategy 2a: Cache the post-processed HTML in ParserCache: This has a storage cost but enables fast read views. Note that Varnish / CDN benefits are on top of this.
    • Strategy 2b: Post-process the HTML on demand for all requests: This cuts storage costs in half but adds loads on servers that post-process Parsoid HTML. Note that we still benefit form Varnish / CDN caching.

Strategy 1 would be the lowest cost approach from the Wikimedia infrastructure point-of-view but can penalize all downstream consumers that pay for bandwidth / data costs.

So, the end-result of this performance evaluation would be to arrive at a bunch of recommendations for the best way to serve read views that meets a set of performance goals (something to be established in consultation with different stakeholders).

Related performance tasks / discussions:

Related Objects

Event Timeline

I think that the performance tests will need to be using HTML from real articles, with properly rendered templates, images, etc. With variants for each scenario. With a set of pages like that (that can be static), we can run this through our synthetic testing platform for in-depth analysis, including running those tests on real underpowered devices on our mobile device lab, since they are more likely to have HTML download and parsing as a bottleneck than simulated devices.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

Yes, that was my implication. I understand what you are saying is that browser HTML isn't trustworthy since it might have been manipulated. But, I vaguely remember that ideas like service-workers were considered to work around that. But, anyway I don't have enough information at this point to say whether that is feasible or not. I'll let the Editing Team (@Esanders) or others involved in building editing clients chime in on it so it can be factored into whatever solution we end up with.

ssastry renamed this task from Evaluate Parsoid HTML size from a performance POV for serving read views to Ensure Parsoid HTML served for read views doesn't degrade performance in low-resource contexts.Jan 19 2022, 6:55 PM

FWIW, @Esanders and the editing team are pretty confident they can avoid corruption of HTML once received by a client, but it would involve something like the service worker approach -- basically as long as the received HTML is not linked into window.document, they can keep it out of the hands of gadgets. A similar approach might be to checksum the "expected" HTML so that "well behaved clients" can reuse the read-view HTML for edit view. For poorly-behaved clients the checksum will fail and they'll have to reload the edit view HTML from the server, but this will still reduce download bandwidth and latency (although current RESTbase responds faster than the action API requests performed at the initiation of editing) for the vast majority of clients.

That said, I think the primary motivation here is to have a mapping between read/edit/etc views that is "as simple as possible". The primary use case is: the reader clicks on something in the edit view of the page (a section edit link, a reply, a particular thing they want to edit, etc). We need to be able to take the (WLOG) xpath to the thing they clicked on and map it to the (WLOG) xpath of the corresponding element in the edit view.

So there needs to be a coherent mechanism for what is allowed/disallowed in the transformation so that (a) this mapping can always be reconstructed/stored, (b) we avoid the case where we have an explosion of subtly-different versions of the HTML. (Putting (b) another way, let's assume we need to store an explicit xpath-to-xpath mapping between every different version of the HTML. If there's just "a read view" and "an edit view", that's only one map we need to generate and store. If we've got "desktop read view", "mobile web read view", "mobile app read view", "edit view", "mobile app edit view" now we've got 16 different mappings to potentially deal with.)

One strawman version of this is that we add/keep an ID tag on every element. Then you can make pretty arbitrary changes to the HTML including stripping elements, altering structure, etc, but you can just match IDs across versions.

An alternate strawman says "any changes are permissible but the structure must not change" -- that is, the xpath to a given node must be preserved between versions. Attributes can be freely deleted, but if you delete a node you *must* leave a placeholder node in its place so that the tree structure is unchanged.

There are certainly middle grounds between these extremes. The most flexible is to keep a full map between the xpaths in the different versions, although with a "rehydration document" which specifies how to undo every change that was made. So when you click node X to edit it, you do a query to the server to map the xpath of X to Y and (assuming your read view HTML still matches the expected checksum) fetch the rehydration document, which will be all the information you need to reconstruct the edit-mode HTML. But even there it would probably be helpful if the entries in the rehydration document were of a limited # of types (which implicitly is then also the set of "things you can do to strip down the edit view HTML").

Change 762459 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Benchmark read view content stripping

https://gerrit.wikimedia.org/r/762459

Change 762459 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Benchmark read view content stripping

https://gerrit.wikimedia.org/r/762459

We run the following benchmark on rt-testing:

Here are the results from a run (commit hash 0352aa34):

image.png (340×605 px, 20 KB)

Change 791051 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Read view benchmark: Add metrics from MW parser

https://gerrit.wikimedia.org/r/791051

Change 791051 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Read view benchmark: Add metrics from MW parser

https://gerrit.wikimedia.org/r/791051

Change 791581 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/parsoid@master] Read view benchmark: Disable profiler report on MW results

https://gerrit.wikimedia.org/r/791581

Change 791581 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Read view benchmark: Disable profiler report on MW results

https://gerrit.wikimedia.org/r/791581

Change 792236 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Change 792236 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Here are the results based on the last rt-testing run (as HTML export):

ssastry renamed this task from Ensure Parsoid HTML served for read views doesn't degrade performance in low-resource contexts to Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degreate performance in low-resource contexts..May 23 2022, 10:06 PM

I edited the task name to better relfect the description. I think we are close to arriving at a recommendation based on @Jgiannelos work on this so far. I'll create other related (parent and sibling tasks) to reflect the full scope of work to be done for read views wrt HTML payload.

ssastry renamed this task from Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degreate performance in low-resource contexts. to Evaluate and recommend strategies for ensuring Parsoid HTML payload doesn't degrade performance in low-resource contexts..May 24 2022, 2:27 AM

TODO

  • Log snippets of what we strip to eyeball whats there
  • Figure out why the numbers of data-mw and typeof don't add up

@Jgiannelos: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

From the task description:
  • data-mw attribute -- currently shipped to all clients but the plan is to strip them from the HTML and stored separately (and for editing clients to fetch them out-of-band on-demand). See T78676: Store & load data-mw separately,
From the task description of T130639:

Google's services should load data-mw from a separate API call to RESTBase

Once the Parsoid and RESTBase APIs start returning data-mw separately, all clients that use Parsoid HTML should fetch the data-mw blob separately (if fetching via RESTBase) or be prepared to handle the separate data-mw blob in the pagebundle API (if fetching via Parsoid).

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

[…] I'll let the Editing Team […] chime in on it so it can be factored into whatever solution we end up with.

Based on the above discussion and tasks like T130639, my understanding/assumptions are:

  • There is a long-term desire for VisualEditor to load faster by not downloading the page content twice.
    • The current thinking toward this, is that VisualEditor would one day use the read-view HTML, and supplement this with data from another endpoint to essentially reconstruct the annotated Parsoid DOM.
      • When we say "read-view HTML" in this context, we're not talking about an API response serving "the same" read-view HTML that the skin embeds in a real article navigation. Instead, we're talking about using the actual live <div class=mw-parsoid-output> DOM of that skinned in-browser pageview on which one activates VisualEditor. I assume this, because otherwise there would be no bandwidth savings.
  • There is similarly a performance incentive for tools and external consumers to be able to download slimmed-down HTML from our API. While not all pageview transformations will be relevant for the API, most optimisations would probably benefit API consumers, too. At least for use cases that don't need the annotations / don't involve editing. This means besides transforms like "summary" and "mobile apps", and the default "full" (sans data-parsoid), there may come a new "optimised" and/or "pageview" variant, perhaps even as the default API response.
  • There are no use cases for separately downloading the "optimised" HTML from a REST API and immediately fetch the supplementaty data to reconstruct the annotated DOM.

FWIW, @Esanders and the editing team are pretty confident they can avoid corruption of HTML once received by a client, […] -- basically as long as the received HTML is not linked into window.document, they can keep it out of the hands of gadgets. […]

I think there may be a miscommunication.

VisualEditor can make a REST API request for Parsoid HTML and keep it uncorrupted. That's how it works today. Whether that API responds fully-annotated Parsoid HTML, or an optimised copy annoted via a supplemental request, doesn't change the integrity of local JavaScript variables. It's not impossible to penetrate, but the same threats would exist either way, including today.

I don't understand what it means to keep the read-view HTML out of window.document. This implies we serve a blank web page.

[…]
A similar approach might be to checksum the "expected" HTML so that "well behaved clients" can reuse the read-view HTML for edit view. For poorly-behaved clients the checksum will fail and they'll have to reload the edit view HTML from the server, but this will still reduce download bandwidth and latency […] for the vast majority of clients.

That said, I think the primary motivation here is to have a mapping between read/edit/etc views that is "as simple as possible". […]

I think it is rare for native code to modify the HTML stream (things that we don't control such as browsers, corp firewalls, and browser extensions). It's not uncommon with in-app browsers of Facebook and Telegram, and browser extensions like UBlock or Honey which modify all sorts of links and elements on the page. But probably rare enough on our sites.

I don't think it is rare for first-party and community-owned scripts to modify the mw-parser-output DOM while reading. This includes first-party software (gallery, collapsible, sortable, TimedMediaHandler, MediaViewer, MobileFrontend), site-wide gadgets, and popular user scripts. Elements are modified, re-ordered, re-parented, inserted, or removed in all sorts of ways. Does this not make it unsuitable for VisualEditor?

As such, does that not make T88623 intractible?

Why I'm asking this now:

  • Performance. On https://de.wiktionary.org/wiki/pleasure the current version of Parsoid read views is enabled. This includes serving <meta property="mw:PageProp/toc" id="mwAg" data-mw="{&quot;autoGenerated&quot;:true}"> and attributes like data-mw='{"parts":[{"template":{"…0}}]}'. This page is about 30% larger than the ?useparsoid=0 equivalent. It appears that the main thing preventing us from shipping improvements that strip this, is an intent to allow for backwards-mapping one day.
  • Consumer complexity. It is simpler to offer multiple formats in our API (and maybe dumps) suitable to the task at hand, rather than require external parties to interact with multiple endpoints and reconstruct what we have internally. I realize T130689 is eight years old, so maybe this is decided differently since, but that task reads like we want to convince (or require) Google (and others?) to do exactly that. I worry that would make our APIs harder to understand and use, and thus impact adoption. Are there reasons for this split-and-reconstruct direction beyond backwards-mapping for on-site editing? Let's assume this can work and VisualEditor gets blazing fast - can we not still serve the full version via the REST API like we do today, and thus not require consumers like Google to do what VE would do?

I don't understand what it means to keep the read-view HTML out of window.document. This implies we serve a blank web page.

I believe what it means is that we serve the read view HTML to the browser, then when the editor launches it "reloads" the exact same window.location but into a document that is kept separate from window.document. This "reload" should in theory come directly from the browser cache and cause no additional network traffic, but will be a "clean" copy of the HTML that is not accessible to toolbar gadgets and other corruption. Then the editor would load the data-mw as a separate network transaction to decorate the clean read view HTML for editing.

This makes some assumptions about browser cache behavior but not unreasonable ones I think.

  • Performance. On https://de.wiktionary.org/wiki/pleasure the current version of Parsoid read views is enabled. This includes serving <meta property="mw:PageProp/toc" id="mwAg" data-mw="{&quot;autoGenerated&quot;:true}"> and attributes like data-mw='{"parts":[{"template":{"…0}}]}'. This page is about 30% larger than the ?useparsoid=0 equivalent. It appears that the main thing preventing us from shipping improvements that strip this, is an intent to allow for backwards-mapping one day.

https://grafana.wikimedia.org/goto/Vgedv1jNR?orgId=1 contains real-time metrics of size bloat between Parsoid and legacy content, and the actual increase in size for the past 14 days as I'm writing this is 16%. This is uncompressed size, BTW; I expect the actual network bloat is even less, as many of the Parsoid attributes lend themselves very well to dictionary-based compression. (This is collected as part of RefreshLinksJob, and /parses/ may not directly relate to /read page views/ either.)

  • Consumer complexity. It is simpler to offer multiple formats in our API (and maybe dumps) suitable to the task at hand, rather than require external parties to interact with multiple endpoints and reconstruct what we have internally. I realize T130689 is eight years old, so maybe this is decided differently since, but that task reads like we want to convince (or require) Google (and others?) to do exactly that. I worry that would make our APIs harder to understand and use, and thus impact adoption. Are there reasons for this split-and-reconstruct direction beyond backwards-mapping for on-site editing? Let's assume this can work and VisualEditor gets blazing fast - can we not still serve the full version via the REST API like we do today, and thus not require consumers like Google to do what VE would do?

<opinion>My preference is to ship a single version of the HTML.</opinion> That maximizes the usefulness of our read view HTML for editor gadgets and other tools, and as you mention reduces API/interface complexity. We've spent a lot of effort minimizing the footprint of Parsoid's additions to the HTML, and there are some additional tools we could deploy there -- this isn't unrelated to the question of separating data-mw and data-parsoid content because encoding those as inline HTML attributes incurs encoding penalties that could be eliminated if they were transferred directly as JSON blobs (although the encoding penalties are probably mitigated by gzip compression so "looks ugly to a human" might not mean "inefficient in practice"). The most space-efficient format is to include a <script> element in the page with the contents of all "rich" (JSON-valued) attributes as a JSON blob, and to additionally encode HTML embedded within those rich attributes as <template> elements in the <head>, and I expect that's probably the form we'll use internally for the ParserCache eventually.

All that said, I think tooling can be a significant factor here. Even the "inline JSON-encoded HTML attribute" format we use for data-mw is not exactly commonplace. Providing good tools and APIs to obtain/decorate and work with our content ought to be a requirement in any case. If there's a one-line mw.decorate() call to obtain the data-mw from an API and apply it to the current page view I don't see that as a major impediment to gadget authors, although (to grant your point) once that makes its way into a gadget deployed by default on a major wiki it does raise performance issues that could be resolved if we just shipped the fully-decorated HTML in the first place, encoded and compressed as best we can.