Page MenuHomePhabricator

Evaluate Parsoid HTML size from a performance POV for serving read views
Open, MediumPublic

Description

This is a first-pass thought-dump draft. Edit and improve to better organize.

For Parsoid read views, we need to evaluate performance on a few different axes. One of them is HTML size. Parsing, CPT (now known as PET), and Performance Teams had a meeting in July 2020 and the notes of that conversation are captured in https://www.mediawiki.org/wiki/Parsing/Parser_Unification/Performance#HTML_output_size.

This phab task is to follow up on that conversation.

A priori, we know that Parsoid HTML is going to be bigger than core parser HTML that is currently being served. Parsoid HTML has a bunch of additional information relative to the core parser HTML (see https://www.mediawiki.org/wiki/Specs/HTML for Parsoid's output HTML spec)

  1. data-parsoid attribute -- currently stripped and stored separately in storage / cache and not shipped to clients
  2. data-mw attribute -- currently shipped to all clients but the plan is to strip them from the HTML and stored separately (and for editing clients to fetch them out-of-band on-demand). See T78676: Store & load data-mw separately.
  3. rel="mw:." attributes on links
  4. typeof=".." attributes on template, extension, and media output
  5. id=".." attribute on all DOM nodes (for indexing into a JSON blob that stores data-parsoid & data-mw attributes offline)
  6. <section> wrappers
  7. marker nodes for various rendering-transparent content in wikitext (interlanguage links, category links, magic word directives, etc.)

We know from usage that editing and various other clients rely on some or all of this information. Parsoid's value proposition is precisely this and other semantic information that insulates all MediaWiki content consumers from having to know much about wikitext parsing to understand and manipulate wiki content while respecting wikitext semantics.

But, as the above list makes clear, that value for various clients comes at a potential HTML size cost.

This task is to evaluate the performance implications of this wrt read views.. I imagine some or all of the following would be needed:

  • Identify a suitable benchmark set of pages (or repeatable methodology to build such a set of pages if we want to periodically reassess decisions made now) across production wikis.
  • Given the above benchmark set, assess the HTML size impact of the extra information Parsoid adds for all items in the above list except for items 1 & 2. There is value to doing this analysis for items 1 & 2 as well for purposes of HTML storage / cache costs, but is otherwise tangential to the primary focus of this task. The tests should be done so that we can evaluate the impact on both raw as well as gzipped (or whatever compression scheme is used for network payloads) output.
  • Come up with anticipated performance impacts given the results from the performance tests above.

Based on these results, we will probably have to identify what what mitigation actions are needed, if any. This next step is likely going to be an evaluation of performance tradeoffs. As observed earlier, given its utility to various downstream Wikimedia content consumers, Parsoid's raw HTML (what we emit today) is unlikely to change significantly, and the API endpoints might continue to serve that unmodified HTML as well. Given this, Parsoid's raw output would need to be cached (for performance reasons).

So, given the above, the performance tradeoffs are likely going to be one of trading CPU time against storage:

  • Strategy 1: Ship Parsoid's raw HTML as is without any stripping: This has an impact on network transfer times - the specific numbers depends on the evaluation results from earlier.
  • Strategy 2: Post-process Parsoid's raw HTML before shipping: This would be a HTML2HTML transform (like many others like mobile-html, mobile-sections, language-variants, etc.) If we are doing this, we can do as aggressive of a post-processing as we want. If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests. There is a tradeoff here around total network transfer across all Parsoid HTML requests (not just readviews). In addition, there are two specific sub-strategies available here:
    • Strategy 2a: Cache the post-processed HTML in ParserCache/Varnish or only Varnish: This has a storage cost but enables fast read views
    • Strategy 2b: Post-process the HTML on demand for all requests: This cuts storage costs in half but adds loads on servers that post-process Parsoid HTML.

Strategy 1 would be the lowest cost approach from the Wikimedia infrastructure point-of-view but can penalize all downstream consumers that pay for bandwidth / data costs.

So, the end-result of this performance evaluation would be to arrive at a bunch of recommendations for the best way to serve read views that meets a set of performance goals (something to be established in consultation with different stakeholders).

Related performance tasks / discussions:

Event Timeline

I think that the performance tests will need to be using HTML from real articles, with properly rendered templates, images, etc. With variants for each scenario. With a set of pages like that (that can be static), we can run this through our synthetic testing platform for in-depth analysis, including running those tests on real underpowered devices on our mobile device lab, since they are more likely to have HTML download and parsing as a bottleneck than simulated devices.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

  • Strategy 2: Post-process Parsoid's raw HTML before shipping: […] If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests.

It sounds like this is meant to imply that, if we serve raw Parsoid HTML on read views, that editor software could sometimes or always not have to fetch content from the API. I don't think that's feasible, however, given dynamic state and such. The HTML is irreversibly lost once received by a client.

Yes, that was my implication. I understand what you are saying is that browser HTML isn't trustworthy since it might have been manipulated. But, I vaguely remember that ideas like service-workers were considered to work around that. But, anyway I don't have enough information at this point to say whether that is feasible or not. I'll let the Editing Team (@Esanders) or others involved in building editing clients chime in on it so it can be factored into whatever solution we end up with.