This is a first-pass thought-dump draft. Edit and improve to better organize.
For Parsoid read views, we need to evaluate performance on a few different axes. One of them is HTML size. Parsing, CPT (now known as PET), and Performance Teams had a meeting in July 2020 and the notes of that conversation are captured in https://www.mediawiki.org/wiki/Parsing/Parser_Unification/Performance#HTML_output_size.
This phab task is to follow up on that conversation.
A priori, we know that Parsoid HTML is going to be bigger than core parser HTML that is currently being served. Parsoid HTML has a bunch of additional information relative to the core parser HTML (see https://www.mediawiki.org/wiki/Specs/HTML for Parsoid's output HTML spec)
- data-parsoid attribute -- currently stripped and stored separately in storage / cache and not shipped to clients
- data-mw attribute -- currently shipped to all clients but the plan is to strip them from the HTML and stored separately (and for editing clients to fetch them out-of-band on-demand). See T78676: Store & load data-mw separately.
- rel="mw:." attributes on links
- typeof=".." attributes on template, extension, and media output
- id=".." attribute on all DOM nodes (for indexing into a JSON blob that stores data-parsoid & data-mw attributes offline)
- <section> wrappers
- marker nodes for various rendering-transparent content in wikitext (interlanguage links, category links, magic word directives, etc.)
We know from usage that editing and various other clients rely on some or all of this information. Parsoid's value proposition is precisely this and other semantic information that insulates all MediaWiki content consumers from having to know much about wikitext parsing to understand and manipulate wiki content while respecting wikitext semantics.
But, as the above list makes clear, that value for various clients comes at a potential HTML size cost.
This task is to evaluate the performance implications of this wrt read views.. I imagine some or all of the following would be needed:
- Identify a suitable benchmark set of pages (or repeatable methodology to build such a set of pages if we want to periodically reassess decisions made now) across production wikis.
- Given the above benchmark set, assess the HTML size impact of the extra information Parsoid adds for all items in the above list except for items 1 & 2. There is value to doing this analysis for items 1 & 2 as well for purposes of HTML storage / cache costs, but is otherwise tangential to the primary focus of this task. The tests should be done so that we can evaluate the impact on both raw as well as gzipped (or whatever compression scheme is used for network payloads) output.
- Come up with anticipated performance impacts given the results from the performance tests above.
Based on these results, we will probably have to identify what what mitigation actions are needed, if any. This next step is likely going to be an evaluation of performance tradeoffs. As observed earlier, given its utility to various downstream Wikimedia content consumers, Parsoid's raw HTML (what we emit today) is unlikely to change significantly, and the API endpoints might continue to serve that unmodified HTML as well. Given this, Parsoid's raw output would need to be cached (for performance reasons).
So, given the above, the performance tradeoffs are likely going to be one of trading CPU time against storage:
- Strategy 1: Ship Parsoid's raw HTML as is without any stripping: This has an impact on network transfer times - the specific numbers depends on the evaluation results from earlier.
- Strategy 2: Post-process Parsoid's raw HTML before shipping: This would be a HTML2HTML transform (like many others like mobile-html, mobile-sections, language-variants, etc.) If we are doing this, we can do as aggressive of a post-processing as we want. If we go this route, we will then require all other semantic content consumers to not use the read-view HTML but fetch it separately via API requests. There is a tradeoff here around total network transfer across all Parsoid HTML requests (not just readviews). In addition, there are two specific sub-strategies available here:
- Strategy 2a: Cache the post-processed HTML in ParserCache/Varnish or only Varnish: This has a storage cost but enables fast read views
- Strategy 2b: Post-process the HTML on demand for all requests: This cuts storage costs in half but adds loads on servers that post-process Parsoid HTML.
Strategy 1 would be the lowest cost approach from the Wikimedia infrastructure point-of-view but can penalize all downstream consumers that pay for bandwidth / data costs.
So, the end-result of this performance evaluation would be to arrive at a bunch of recommendations for the best way to serve read views that meets a set of performance goals (something to be established in consultation with different stakeholders).
Related performance tasks / discussions:
- Impact on rendering time on the client based on CSS / JS resource sizes: See T270150: Selectors in content.media.less need improvement in terms of performance and stability. T51097#6690317 and T51097#6690702 are related that led to T270150. This isn't determined by HTML size strictly.
- T51097#6690893 is related discussion / considerations that might help evaluate the performance mitigation strategies.