In large, template-heavy articles, the data-mw attribute increases the size of the HTML significantly. See (some old) numbers below for the Barack Obama page.
Mobile: 972K Barack_Obama.mobile.html # as returned by the mobile site (with some chrome) 172K Barack_Obama.mobile.html.gz Parsoid: 3.5M Barack_Obama.html # as returned by Parsoid v1 API 467K Barack_Obama.html.gz 1003K Barack_Obama.no_data-mw.html # without data-parsoid and data-mw Parsoid v2 API via RESTBase (no data-parsoid, with element ids) 2.8M Barack_Obama.no_data-parsoid.html # as returned by restbase 414K Barack_Obama.no_data-parsoid.html.gz 1.2M Barack_Obama.no_data-mw.html # restbase minus data-mw 214K Barack_Obama.no_data-mw.html.gz RESTBase data-mw: 1.6M Barack_Obama.datamw.json 209K Barack_Obama.datamw.json.gz
This is a lot of overhead for read views which don't need this information.
Here are the following things that need to be done for a complete switchover to this new format:
- Add data-mw as a separate json blob in the pagebundle output of Parsoid's API. Note that just like with data-parsoid, Parsoid will emit a version string for this blob. This will also bump the major version number for Parsoid's HTML.
- Allocate a bucket in RESTBase storage for storing data-mw.
- Ensure that all Parsoid HTML clients are passing in the Accept: header with the format they are equipped to handle.
- Ensure that Parsoid HTML clients that use data-mw can handle the new HTML version without inlined data-mw.
- Implement a HTML2HTML endpoint in Parsoid to make sure requests for older HTML versions can be respected till such time all clients switch over the data-mw separated version.
In a future iteration, Parsoid will start providing clients with HTML versions of template args (%). It is still unclear if these HTML template args will be generated as part of normal parse, or if they will be generated on-demand. But, the current thinking is to create a new data-* attribute to provide this information rather than add it to the data-mw JSON blob. data-mw is generated for templates, extensions, and images currently. This information can either be wikitext, HTML, strings, or some combination of those depending on the element. We should resolve whether only the tpl-args-html will be part of the separate attribute, or if there are other ways of splitting up data-mw in ways that makes it most useful for clients.
(%) Clients can already do this right now by passing in the wikitext to the Parsoid wt2html endpoints, but this won't get the benefit of caching.
In terms of timeline and implementation, here is how this work might proceed:
- Have all known Parsoid clients pass in the accept header as part of their requests.
- Implement data-mw separation in the Parsoid pagebundle API (alongwith the version bumps for html and version init for data-mw) -- but don't turn it on yet.
- Implement Parsoid html2html endpoint in Parsoid.
- Resolve the question about how to organize the data-mw information into 1 or 2 attributes (and the name of the new attribute).
- Have RESTBase allocate storage for the new bucket / buckets.
- Have the most active clients implement support for the split data-mw attribute.
- Turn on data-mw split in Parsoid and RESTBase.
- Have clients bump their version numbers to accept the new format (while those that aren't ready will get the old version via Parsoid's html2html endpoint).