Page MenuHomePhabricator

Analyze rt-testing data and identify pages whose Parsoid HTML sizes are "outside limits"
Closed, ResolvedPublic

Description

CAVEAT: For talk pages, you may have to strip the reply links from the core/legacy HTML before gathering stats since there isn't a way to suppress them in output.

With that caveat above, and looking at the current results at T272331#7934893 , a few things pop out:

  • From the second chart: mean Parsoid-HTML-size / legacy-HTML-size is 1.2x and eyeballing the chart, about 1.4x covers the 95% percentile of Parsoid page size bloat.
  • From the third chart: eyeballing it, it appears that p75 of Parsoid-stripped-HTML-size / legacy-HTML-size is about 1. So, there is a quartile of pages where even after stripping, the Parsoid HTML is larger, and there is a small fraction where the bloat is over 1.2x

So, it would be useful to analyze this a bit more:

  • Generate a set of pages where Parsoid-HTML-size / legacy-HTML-size > 1.4x so we can understand what in Parsoid output is causing this and if there is something we can do here.
  • Similarly, generate a set of pages where Parsoid-stripped-HTML-size / legacy-HTML-size > 1.1x so we can understand what in Parsoid is causing this and if there is something we can do here. The 1.1x is semi-arbitrary assuming 10% penalty might be acceptable for now. We can revisit this in the future, if necessary.

Expected outcome of this task: Recommendations (tasks filed) as to what needs additional investigation / fixing. That could also include recommendations as to what else to strip on some of those bloaty pages that might make the Parsoid-stripped-HTML-size / legacy-HTML-size <= 1 on a larger fraction of pages.

Event Timeline

ssastry renamed this task from Analyze rt-testing data and identify pages whose Parsoid HTML size are "outside limits" to Analyze rt-testing data and identify pages whose Parsoid HTML sizes are "outside limits".May 24 2022, 2:28 AM
ssastry updated the task description. (Show Details)
Arlolra triaged this task as Medium priority.Jun 2 2022, 5:55 PM
Arlolra moved this task from Needs Triage to Performance on the Parsoid board.

TODO

  • Log snippets of what we strip to eyeball whats there
  • Figure out why the numbers of data-mw and typeof don't add up