Page MenuHomePhabricator

Test size of "Reading stripped" HTML vs non-stripped HTML
Closed, ResolvedPublic

Description

Test the output of the new HTML point.

  1. Without stripping unneeded tags and content
  2. After stripping unneeded tags and content

This is to verify that the optimizations being made by RI have a meaningful impact.

Two restrictions:

  1. Both sets of HTML should have all the new markup added by the HTML API
  2. The size should be after gzip compression

Event Timeline

Change 361466 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add scripts to measure payloads

https://gerrit.wikimedia.org/r/361466

Here are some preliminary results (before stripping of references is implemented) with a small sample of test pages (some were taken from the most-read results from 6/22/2017):
If I don't strip any HTML and just add some of the markers and other changes needed MCS read-html adds around 1.9% to the payload.
If stripping of unneeded tags is included MCS read-html reduces the payload from 23% to 47% (avg. around 37%).

ParsoidNo stripping transformationsWith stripping transformations
titleparsoidparsoid-gzread-htmlread-html-gz% (gz)read-htmlread-html-gz% (gz)
Barack_Obama17563572966671951413302902102.10%83641517028457.40%
San_Francisco11290021868151245967190644102.05%60853512086464.70%
Earth10091821846011102470187190101.40%4864119875853.50%
Albert_Einstein718942134236807682137983102.79%3987099083367.67%
Wonder_Woman_(2017_film)622614101675702077103736102.03%2756425920758.23%
History_of_science578812115748628118117743101.72%3632618509173.51%
Oprah_Winfrey569838105697631820107577101.78%2988866955265.80%
2017_FIFA_Confederations_Cup5327685832857934759478101.97%3377503852366.05%
Deaths_in_20174154567529844638575825100.70%2569025844377.62%
Kyoto3718526519141346966840102.53%2331434470968.58%
Transformers: The Last Knight3698055944540566360390101.59%1867593738162.88%
Daniel_Day-Lewis3507015855037276459181101.08%2306374207971.87%
Karen_Handel2835574819931473849150101.97%1502753224666.90%
Shooting_of_Philando_Castile2668924662429937847577102.04%1321183011264.58%
Travis_Kalanick1988063304822735633753102.13%761391806954.68%
Gal_Gadot1666023018218512030675101.63%848201819960.30%
Jon_Ossoff1503032627117096026968102.65%636101499457.07%
Otto_Warmbier1400192490115911125490102.37%619831472059.11%
Prodigy_(rapper)1102672365212180224006101.50%614011620668.52%
Summer_solstice69071130867281113187100.77%45073981675.01%
Mobb_Deep66576152107193315385101.15%406151106072.72%
SUM98774221703424109103841735680101.89%5229084108114663.47%

Very interesting. Seems in line with previous research 👍

The bulk of the space-saving transformations comes from https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transforms.js#L193-L268.

To add to this we want to

  • get the top 1000 page views of the month from PageViews AP automatically and
  • run this against zhwiki, too.

Here's a spreadsheet with the results against the top monthly titles in enwiki and zhwiki.

Here's a summary of the payload sizes I measured. The spreadsheet compares the gzipped (-6) payloads: plain Parsoid against various MCS/PCS read-html variants: no stripping, just stripping of unneeded markup, just stripping of reference lists, stripping both.

enwiki:

  • no stripping: adds about 2% to the payload
  • stripping of unneeded markup: saves around 36%
  • stripping of reference lists: saves around 36%
  • stripping of both unneeded markup and reference lists: saves around 53%

The savings of stripping reflists varies more than the savings of unneeded markup. Adding reflist stripping to already stripping unneeded markup only save almost around 17% points of the original payload.

zhwiki:

  • no stripping: adds about 1.2% to the payload
  • stripping of unneeded markup: saves around 36%
  • stripping of reference lists: saves around 15%
  • stripping of both unneeded markup and reference lists: saves around 42%

Adding reflist stripping to already stripping unneeded markup only save almost around 7% points of the original payload.

This leads to the question: is ref stripping really worth the hassle?

Screen shot of the averages…

Screen Shot 2017-07-17 at 10.48.33 AM.png (180×2 px, 135 KB)

@bearND looking over it seems like both striping reflists and removing unneeded markup both give a significant savings (Both about 36% on en). But as you noted: when combined both isn't as drastic as you would have thought (About 53% on en). Having said that, an extra 16% savings isn't insignificant.

zh seems to be not as heavily referenced for sure. This may be a cultural thing.

As far as next steps are concerned, I think we have enough data to proceed with the parsoid markup stripping in the PCS.

For the references, lets keep pushing on the structured data exploration and see how that turns out and then make decisions after we determine if we can reliably parse that data.

Sound like a plan to you?

Yes, sounds like a plan. I was just a bit surprised to see that the savings don't get close to adding up. My explanation for this is that the transforms for stripping the unneeded markup heavily strip the reference list content. There's still one included which I'll have to take out: stripping of ref back links. For the Android app we remove them but for the web case I think we would want to preserve them.

I looked a bit into the effect of stripping specific elements, focusing on those that aren't essential for editing. We plan to move data-mw out in any case (see T78676), so I started with that. All ratios are relative to the original Parsoid HTML, all using gzip -6.

titledata-mwdata-mw, idsdata-mw, ids, aboutdata-mw, ids, about, rel
Barack_Obama.84.74.71.70
San_Francisco.88.78.75.74
Earth.76.66.63.62
Albert_Einstein.88.79.76.75
Wonder_Woman_(2017_film).87.75.71.70
History_of_science.92.82.80.79
Oprah_Winfrey.89.78.75.74
2017_FIFA_Confederations_Cup.90.78.75.74
Deaths_in_2017.95.82.77.75
Kyoto.85.78.76.74
Transformers:_The_Last_Knight.88.76.73.72
Daniel_Day-Lewis.91.81.79.77
Karen_Handel.89.80.77.76
Shooting_of_Philando_Castile.89.78.75.74
Travis_Kalanick.87.75.70.70
Gal_Gadot.87.74.71.70
Jon_Ossoff.87.74.70.70
Otto_Warmbier.84.73.68.68
Prodigy_(rapper).87.75.71.71
Summer_solstice.94.84.82.81
Mobb_Deep.91.82.80.79
avg0.880.770.740.73

So I think it is quite clear that moving out data-mw & reducing the number of ids would already realize a significant portion of the gains you saw with the more aggressive removal. IDs are currently used to anchor metadata, but we could replace this with a lighter weight scheme using path selectors (nth-child etc).

About attributes are heavily used to associate references with their reflist entry. They are somewhat redundant with the href. Both about & href could be removed if we inlined reference content.

The vast majority of rel attributes are mw:Wikilink or Extlink. This information is already implicit in the link target, so could just be removed without loss of information or editing support. However, this does not make a huge difference as they compress away mostly.

Quick & dirty script used to generate the numbers above:

#!/bin/bash

if [ -z $1 ];then
    echo "Usage: $0 <title>"
    exit 1
fi

file="$1.html"
curl -s "https://en.wikipedia.org/api/rest_v1/page/html/$1" | gzip -6 > "$file"
size_raw=$(wc -c < "$file")
size_data_mw=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" | gzip -6 | wc -c)
ratio_data_mw=$(bc <<< "scale=2; $size_data_mw / $size_raw")

size_data_mw_ids=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ id="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids=$(bc <<< "scale=2; $size_data_mw_ids / $size_raw")


size_data_mw_ids_about=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ \(id\|about\)="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids_about=$(bc <<< "scale=2; $size_data_mw_ids_about / $size_raw")

size_data_mw_ids_about_rel=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ \(id\|about\|rel\)="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids_about_rel=$(bc <<< "scale=2; $size_data_mw_ids_about_rel / $size_raw")

echo "|$1|$ratio_data_mw|$ratio_data_mw_ids|$ratio_data_mw_ids_about|$ratio_data_mw_ids_about_rel|"

I called this on a list of titles using

cat titles.txt | while read t; do ./checkSizes.sh "$t"; done

Thank you for this. For some of the pages I measured this looks quite close to the more aggressive markup stripping MCS is doing (usually < 15% points). I agree that the rel attributes don't make much difference.

Yes, sounds like a plan. I was just a bit surprised to see that the savings don't get close to adding up. My explanation for this is that the transforms for stripping the unneeded markup heavily strip the reference list content.

I noticed the same when doing research in the past. My hypothesis was that one as well. I realized trying to make the pie charts add up to 100%, it doesn't work because some transforms overlap with others.

GWicke triaged this task as Medium priority.Aug 8 2017, 9:32 PM

Change 361466 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add scripts to measure payloads

https://gerrit.wikimedia.org/r/361466

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)