In most pages with little templated content, `data-parsoid` removal is proving to be very effective.In large, In a dump of enwikitemplate-heavy articles, v2 html uses a bit over 100G in Cassandra (lz4 compression),the `data-mw` attribute increases the size of the HTML significantly. while data-parsoid uses a bit over 60GSee (some old) numbers below for the Barack Obama page.
In large, template-heavy articles, the `data-mw` attribute however still increases the size of the HTML significantly:
972K Barack_Obama.mobile.html # as returned by the mobile site (with some chrome)
3.5M Barack_Obama.html # as returned by Parsoid v1 API
2.7M Barack_Obama.no_data-parsoid.html # without data-parsoid
1003K Barack_Obama.no_data-mw.html # without data-parsoid and data-mw
Parsoid v2 API via RESTBase (no data-mw.html.gz parsoid, with element ids)
994K2.8M Barack_Obama.no_pagepropsdata-parsoid.html # without data-parsoid, data-mw and pagepropsas returned by restbase
946K1.2M Barack_Obama.no_linkreldata-mw.html # without data-parsoid,restbase minus data-mw, pageprops & link rel
Parsoid v2 API via RESTBase (no data-parsoid, with element ids)mw:
2.81.6M Barack_Obama.no_data-parsoid.html # as returned by restbasedatamw.json
1.2M Barack_Obama.no_data-mw.html # restbase minus data-mw
1.1M Barack_Obama.no_pageprops.html # restbase minus data-mw and pageprops```
This is a lot of overhead for read views which don't need this information.
Here are the following things that need to be done for a completely switchover to this new format:
* Add data-mw as a separate json blob in the pagebundle output of Parsoid's API. Note that just like with data-parsoid, Parsoid will emit a version string for this blob. This will also bump the major version number for Parsoid's HTML.
212K Barack_Obama.no_pageprops.html.gz* Allocate a bucket in RESTBase storage for storing data-mw.
1.1M Barack_Obama.no_linkrel.html # restbase minus data-mw, pageprops link rel* Ensure that all Parsoid HTML clients are passing in the Accept: header with the format they are equipped to handle.
210K Barack_Obama.no_linkrel.html.gz* Ensure that Parsoid HTML clients that use data-mw be able to handle the new HTML version without inlined data-mw.
* Implement a HTML2HTML endpoint in Parsoid to make sure requests for older HTML versions can be respected till such time all clients switch over the data-mw separated version.
RESTBase data-mw:In a future iteration, Parsoid will start providing clients with HTML versions of template args (%). It is still unclear if these HTML template args will be generated as part of normal parse, or if they will be generated on-demand. But, the current thinking is to create a new data-* attribute to provide this information rather than add it to the data-mw JSON blob. data-mw is generaed for templates, extensions, and images currently. This information can either be wikitext, HTML, strings, or some combination of those depending on the element. One thing that would be
Apart from the large number of transclusions in this page, a contributing factor is that data-mw values still contain a lot of embedded data-parsoid attributes (which should be removed).useful to resolve is whether only the tpl-args-html will be part of the separate attribute, or if there are other ways of splitting up data-mw in ways that makes it most useful for clients.
(%) Clients can already do this right now by passing in the wikitext to the Parsoid wt2html endpoints, but this won't get the benefit of caching.
In terms of timeline and implementation, Without data-parsoid, the sizes for data-mw are:
```here is how this work might proceed:
* Have all known Parsoid clients pass in the accept header as part of their requests
* Implement data-mw separate in the Parsoid pagebundle API (alongwith the version bumps for html and version init for data-mw) -- but don't turn it on yet.
1.2M Barack_Obama.datamw.no_datap* Implement Parsoid html2html endpoint in Parsoid.json
178K Barack_Obama.* Resolve the question about how to organize the datamw.no_dataparsoid.json.gz-mw information into 1 or 2 attributes (and the name of the new attribute)
This means that we need to extract `data-mw` as well before we can use Parsoid HTML for page views in production. Much of the infrastructure for that is already in place (restbase has a bucket and API for it already), but there is still a good amount of front-end work to be done for Parsoid HTML based views. VE will need to load data-mw separately, and will need to send it back separately.
Data-mw removal is also a precondition for providing parsed (html) template parameters, as these will increase the size of data-mw.
Another interesting fact is that Parsoid HTML can be slightly smaller than PHP output (both compressed and uncompressed) if we leave out element ids: 210k / 1.1M with ids vs. 170k / 946k without ids. It might be worth looking into only setting element ids on top-level elements & possibly using some other technique like xpaths to associate metadata with nested elements. This is further discussed in T87556.* Have RESTBase allocate storage for the new bucket / buckets
* Have the most active clients implement support for the split data-mw attribute
* Turn on data-mw split in Parsoid and RESTBase.
* Have clients bump their version numbers to accept the new format (while those that aren't ready will get the old version via Parsoid's html2html endpoint)