CAVEAT: For talk pages, you may have to strip the reply links from the core/legacy HTML before gathering stats since there isn't a way to suppress them in output.
With that caveat above, and looking at the current results at T272331#7934893 , a few things pop out:
* From the second chart: mean Parsoid-HTML-size / legacy-HTML-size is 1.2x and eyeballing the chart, about 1.4x covers the 95% percentile of Parsoid page size bloat.
* From the third chart: eyeballing it, it appears that p75 of Parsoid-stripped-HTML-size / legacy-HTML-size is about 1. So, there is a quartile of pages where even after stripping, the Parsoid HTML is larger, and there is a small fraction where the bloat is over 1.2x
So, it would be useful to analyze this a bit more:
* Generate a set of pages where Parsoid-HTML-size / legacy-HTML-size > 1.4x so we can understand what in Parsoid output is causing this and if there is something we can do here.
* Similarly, generate a set of pages where Parsoid-stripped-HTML-size / legacy-HTML-size > 1.1x so we can understand what in Parsoid is causing this and if there is something we can do here. The 1.1x is semi-arbitrary assuming 10% penalty might be acceptable for now. We can revisit this in the future, if necessary.
Expected outcome of this task: Recommendations (tasks filed) as to what needs additional investigation / fixing. That could also include recommendations as to what else to strip on some of those bloaty pages that might make the Parsoid-stripped-HTML-size / legacy-HTML-size <= 1 on a larger fraction of pages.