Store & load data-mw separately
Open, NormalPublic0 Story Points

Description

In large, template-heavy articles, the data-mw attribute increases the size of the HTML significantly. See (some old) numbers below for the Barack Obama page.

Mobile:
  972K Barack_Obama.mobile.html              # as returned by the mobile site (with some chrome)
  172K Barack_Obama.mobile.html.gz

Parsoid:
  3.5M Barack_Obama.html                     # as returned by Parsoid v1 API
  467K Barack_Obama.html.gz
 1003K Barack_Obama.no_data-mw.html          # without data-parsoid and data-mw

Parsoid v2 API via RESTBase (no data-parsoid, with element ids)
  2.8M Barack_Obama.no_data-parsoid.html     # as returned by restbase
  414K Barack_Obama.no_data-parsoid.html.gz
  1.2M Barack_Obama.no_data-mw.html          # restbase minus data-mw
  214K Barack_Obama.no_data-mw.html.gz

RESTBase data-mw:
  1.6M Barack_Obama.datamw.json
  209K Barack_Obama.datamw.json.gz

This is a lot of overhead for read views which don't need this information.

Here are the following things that need to be done for a complete switchover to this new format:

  • Add data-mw as a separate json blob in the pagebundle output of Parsoid's API. Note that just like with data-parsoid, Parsoid will emit a version string for this blob. This will also bump the major version number for Parsoid's HTML.
  • Allocate a bucket in RESTBase storage for storing data-mw.
  • Ensure that all Parsoid HTML clients are passing in the Accept: header with the format they are equipped to handle.
  • Ensure that Parsoid HTML clients that use data-mw can handle the new HTML version without inlined data-mw.
  • Implement a HTML2HTML endpoint in Parsoid to make sure requests for older HTML versions can be respected till such time all clients switch over the data-mw separated version.

In a future iteration, Parsoid will start providing clients with HTML versions of template args (%). It is still unclear if these HTML template args will be generated as part of normal parse, or if they will be generated on-demand. But, the current thinking is to create a new data-* attribute to provide this information rather than add it to the data-mw JSON blob. data-mw is generated for templates, extensions, and images currently. This information can either be wikitext, HTML, strings, or some combination of those depending on the element. We should resolve whether only the tpl-args-html will be part of the separate attribute, or if there are other ways of splitting up data-mw in ways that makes it most useful for clients.

(%) Clients can already do this right now by passing in the wikitext to the Parsoid wt2html endpoints, but this won't get the benefit of caching.

In terms of timeline and implementation, here is how this work might proceed:

  • Have all known Parsoid clients pass in the accept header as part of their requests.
  • Implement data-mw separation in the Parsoid pagebundle API (alongwith the version bumps for html and version init for data-mw) -- but don't turn it on yet.
  • Implement Parsoid html2html endpoint in Parsoid.
  • Resolve the question about how to organize the data-mw information into 1 or 2 attributes (and the name of the new attribute).
  • Have RESTBase allocate storage for the new bucket / buckets.
  • Have the most active clients implement support for the split data-mw attribute.
  • Turn on data-mw split in Parsoid and RESTBase.
  • Have clients bump their version numbers to accept the new format (while those that aren't ready will get the old version via Parsoid's html2html endpoint).

See also

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Arlolra added a subscriber: Arlolra.Mar 5 2015, 9:23 PM

More discussion on moving out data-mw in T54936.

GWicke updated the task description. (Show Details)Mar 13 2015, 3:01 AM
GWicke edited a custom field.
GWicke lowered the priority of this task from High to Normal.Mar 15 2015, 4:29 PM
GWicke moved this task from Backlog to Future on the RESTBase board.Mar 17 2015, 8:24 PM
Arlolra claimed this task.Apr 2 2015, 5:53 PM
GWicke added a comment.Apr 2 2015, 7:03 PM

@Arlolra, should we aim for storing wikitext & html variants separately right from the start?

@GWicke: can you clarify what you mean?

GWicke added a comment.EditedApr 2 2015, 8:47 PM

@Arlolra, storing data-mw-wt and data-mw-html separately. The 'wt' variant would have all template parameters etc as wikitext, while the 'html' variant would have it as HTML. Motivation is the size of HTML parameters in particular, and the high probability of clients only needing one variant at a time.

Oh, ok, you're talking template parameters. Is it desirable to offer the two varieties in the long run? Or is this change a spec version bump?

GWicke added a comment.Apr 2 2015, 8:54 PM

@Arlolra, data-mw-wt would basically be what Parsoid returns right now in production, so no need for clients to change. There is already a provision for encoding multiple formats for each template parameter, but we'd significantly increase the size of the download if we always provided both HTML *and* wikitext.

ssastry moved this task from In Progress to Backlog on the Parsoid board.Jun 16 2015, 9:16 PM
marcoil removed a subscriber: marcoil.Jun 17 2015, 7:12 AM
ssastry moved this task from Backlog to Next Up on the Parsoid board.Dec 17 2015, 5:43 PM
jmadler added a subscriber: jmadler.Jan 6 2016, 5:13 AM
GWicke added a comment.EditedJan 12 2016, 11:49 PM

Etherpad of recent discussion between reading, parsing & services about reducing the HTML size:

https://etherpad.wikimedia.org/p/htmlsize

Main priorities from the reading side as far as I understood them:

  1. Marking up navbox / reference using a page component like interface (T105845),
  2. providing an API for lazy loading of navboxes / references, and
  3. removing data-mw, slightly simplifying and speeding up processing in the mobile content service & other consumers.
GWicke added a subscriber: phuedx.
ssastry updated the task description. (Show Details)Mar 10 2016, 11:39 PM

@ssastry: Thanks for updating the task description. I'm broadly in agreement with your outline.

Since our discussion yesterday I have warmed up to your idea of treating the move to HTML as a format evolution of data-mw, rather than a new content-type. If the size penalty is not prohibitive, we could consider serving both HTML *and* wikitext for a while, so it would be backwards-compatible for wikitext-only clients. It seems very possible that the size penalty will be smaller than feared, as the content between HTML and wikitext for most template parameters will be very repetitive. Compression should encode the second copy largely as a pointer to the first one.

ssastry updated the task description. (Show Details)Mar 12 2016, 6:41 PM

As far as HTML representation of template args is concerned, one issue we were talking about was that the HTML representation would not be how it might show up in the template output. For example, in the transclusion {{foo|*x}}, we cannot know a priori if the *x will show up as the string "*x" or as a list item <li>x</li>.

One solution to this conundrum is to ignore the problem and not worry about this at all. We could parse the template args as if it is in SOL context always while being explicit that this representation is only present for the purpose of editing, and might not be how the arg might end up in the template .. i.e. the semantics of the HTML representation is explicitly about it being a visual editing aid rather than how it is used in the template. In some cases, this might result in surprising WYSIWYG semantics, but, that might be an acceptable compromise. @Jdforrester-WMF, FYI.

Elitre added a subscriber: Elitre.Mar 23 2016, 2:49 PM
Arlolra moved this task from Next Up to In Progress on the Parsoid board.Apr 8 2016, 5:27 PM
RandomDSdevel rescinded a token.
RandomDSdevel awarded a token.
Arlolra removed Arlolra as the assignee of this task.Apr 27 2016, 5:33 PM
Restricted Application added a project: ContentTranslation. · View Herald TranscriptOct 12 2016, 3:59 PM
GWicke updated the task description. (Show Details)Oct 12 2016, 5:48 PM
Amire80 moved this task from Backlog to Upstream on the ContentTranslation board.Oct 31 2016, 1:50 PM
Arlolra moved this task from In Progress to Backlog on the Parsoid board.Jan 4 2017, 1:16 AM
GWicke moved this task from next to later on the Services board.Jul 12 2017, 11:54 PM
GWicke edited projects, added Services (later); removed Services (next).
GWicke added a comment.EditedJul 27 2017, 8:24 PM

A more recent study of the effect of a few HTML constructs, including data-mw, is now available at T164033#3479227. It shows that data-mw removal by itself won't result in savings much greater than 12% across a list of 21 popular titles. Difficult-to-process ID attributes are responsible for almost as much size by themselves.

However, the good news is that we can save around 30% compressed read HTML size with a few changes that won't remove any content, or affect the ability to edit:

  • Move out data-mw.
  • Remove auto-generated mwXX ID attributes. To associate metadata with elements, use path based identifiers (possibly rooted at section ids) instead. Example: ['mw1234',2,5,2], or #mw1234:nth-child(2):nth-child(5):nth-child(2)
  • Remove about attributes from references, and rely on the href instead. Perhaps also consider inlining reference content, and eliminating the href to the reflist as well.
  • Drop rel="mw:Wikilink" and rel="mw:ExtLink". This information is easy to derive from the href attribute (relative vs. fully qualified URL).

There are probably some more opportunities along similar lines, but I think this captures the biggest optimization opportunities.