Page MenuHomePhabricator

Store & load data-mw separately
Open, MediumPublic0 Estimated Story Points


In large, template-heavy articles, the data-mw attribute increases the size of the HTML significantly. See (some old) numbers below for the Barack Obama page.

  972K              # as returned by the mobile site (with some chrome)

  3.5M Barack_Obama.html                     # as returned by Parsoid v1 API
  467K Barack_Obama.html.gz
 1003K Barack_Obama.no_data-mw.html          # without data-parsoid and data-mw

Parsoid v2 API via RESTBase (no data-parsoid, with element ids)
  2.8M Barack_Obama.no_data-parsoid.html     # as returned by restbase
  414K Barack_Obama.no_data-parsoid.html.gz
  1.2M Barack_Obama.no_data-mw.html          # restbase minus data-mw
  214K Barack_Obama.no_data-mw.html.gz

RESTBase data-mw:
  1.6M Barack_Obama.datamw.json
  209K Barack_Obama.datamw.json.gz

This is a lot of overhead for read views which don't need this information.

Here are the following things that need to be done for a complete switchover to this new format:

  • Add data-mw as a separate json blob in the pagebundle output of Parsoid's API. Note that just like with data-parsoid, Parsoid will emit a version string for this blob. This will also bump the major version number for Parsoid's HTML.
  • Allocate a bucket in RESTBase storage for storing data-mw.
  • Ensure that all Parsoid HTML clients are passing in the Accept: header with the format they are equipped to handle.
  • Ensure that Parsoid HTML clients that use data-mw can handle the new HTML version without inlined data-mw.
  • Implement a HTML2HTML endpoint in Parsoid to make sure requests for older HTML versions can be respected till such time all clients switch over the data-mw separated version.

In a future iteration, Parsoid will start providing clients with HTML versions of template args (%). It is still unclear if these HTML template args will be generated as part of normal parse, or if they will be generated on-demand. But, the current thinking is to create a new data-* attribute to provide this information rather than add it to the data-mw JSON blob. data-mw is generated for templates, extensions, and images currently. This information can either be wikitext, HTML, strings, or some combination of those depending on the element. We should resolve whether only the tpl-args-html will be part of the separate attribute, or if there are other ways of splitting up data-mw in ways that makes it most useful for clients.

(%) Clients can already do this right now by passing in the wikitext to the Parsoid wt2html endpoints, but this won't get the benefit of caching.

In terms of timeline and implementation, here is how this work might proceed:

  • Have all known Parsoid clients pass in the accept header as part of their requests.
  • Implement data-mw separation in the Parsoid pagebundle API (alongwith the version bumps for html and version init for data-mw) -- but don't turn it on yet.
  • Implement Parsoid html2html endpoint in Parsoid.
  • Resolve the question about how to organize the data-mw information into 1 or 2 attributes (and the name of the new attribute).
  • Have RESTBase allocate storage for the new bucket / buckets.
  • Have the most active clients implement support for the split data-mw attribute.
  • Turn on data-mw split in Parsoid and RESTBase.
  • Have clients bump their version numbers to accept the new format (while those that aren't ready will get the old version via Parsoid's html2html endpoint).

See also

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Arlolra, data-mw-wt would basically be what Parsoid returns right now in production, so no need for clients to change. There is already a provision for encoding multiple formats for each template parameter, but we'd significantly increase the size of the download if we always provided both HTML *and* wikitext.

Etherpad of recent discussion between reading, parsing & services about reducing the HTML size:

Main priorities from the reading side as far as I understood them:

  1. Marking up navbox / reference using a page component like interface (T105845),
  2. providing an API for lazy loading of navboxes / references, and
  3. removing data-mw, slightly simplifying and speeding up processing in the mobile content service & other consumers.

@ssastry: Thanks for updating the task description. I'm broadly in agreement with your outline.

Since our discussion yesterday I have warmed up to your idea of treating the move to HTML as a format evolution of data-mw, rather than a new content-type. If the size penalty is not prohibitive, we could consider serving both HTML *and* wikitext for a while, so it would be backwards-compatible for wikitext-only clients. It seems very possible that the size penalty will be smaller than feared, as the content between HTML and wikitext for most template parameters will be very repetitive. Compression should encode the second copy largely as a pointer to the first one.

As far as HTML representation of template args is concerned, one issue we were talking about was that the HTML representation would not be how it might show up in the template output. For example, in the transclusion {{foo|*x}}, we cannot know a priori if the *x will show up as the string "*x" or as a list item <li>x</li>.

One solution to this conundrum is to ignore the problem and not worry about this at all. We could parse the template args as if it is in SOL context always while being explicit that this representation is only present for the purpose of editing, and might not be how the arg might end up in the template .. i.e. the semantics of the HTML representation is explicitly about it being a visual editing aid rather than how it is used in the template. In some cases, this might result in surprising WYSIWYG semantics, but, that might be an acceptable compromise. @Jdforrester-WMF, FYI.

A more recent study of the effect of a few HTML constructs, including data-mw, is now available at T164033#3479227. It shows that data-mw removal by itself won't result in savings much greater than 12% across a list of 21 popular titles. Difficult-to-process ID attributes are responsible for almost as much size by themselves.

However, the good news is that we can save around 30% compressed read HTML size with a few changes that won't remove any content, or affect the ability to edit:

  • Move out data-mw.
  • Remove auto-generated mwXX ID attributes. To associate metadata with elements, use path based identifiers (possibly rooted at section ids) instead. Example: ['mw1234',2,5,2], or #mw1234:nth-child(2):nth-child(5):nth-child(2)
  • Remove about attributes from references, and rely on the href instead. Perhaps also consider inlining reference content, and eliminating the href to the reflist as well.
  • Drop rel="mw:Wikilink" and rel="mw:ExtLink". This information is easy to derive from the href attribute (relative vs. fully qualified URL).

There are probably some more opportunities along similar lines, but I think this captures the biggest optimization opportunities.