Test size of "Reading stripped" HTML vs non-stripped HTML
Closed, ResolvedPublic
Actions

Description

Test the output of the new HTML point.

Without stripping unneeded tags and content
After stripping unneeded tags and content

This is to verify that the optimizations being made by RI have a meaningful impact.

Two restrictions:

Both sets of HTML should have all the new markup added by the HTML API
The size should be after gzip compression

Details

	Subject	Repo	Branch	Lines +/-
	Add scripts to measure payloads	mediawiki/services/mobileapps	master	+2 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	• Jhernandez	T177424 Develop Compatibility Layer of PCS
Resolved	• bearND	T162179 Extract HTML Compatibility Layer from MCS Mobile Sections API
Resolved	• bearND	T164033 Test size of "Reading stripped" HTML vs non-stripped HTML

Event Timeline

• Fjalapeno created this task.Apr 27 2017, 8:30 PM

• Fjalapeno edited projects, added Mobile-Content-Service; removed Mobile-Content-Service (Kanban).Apr 28 2017, 8:02 PM

• bearND moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jun 20 2017, 5:33 PM

Change 361466 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add scripts to measure payloads

https://gerrit.wikimedia.org/r/361466

gerritbot added a project: Patch-For-Review.Jun 26 2017, 3:13 PM

• bearND moved this task from Doing to Code Review on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jun 26 2017, 3:13 PM

Here are some preliminary results (before stripping of references is implemented) with a small sample of test pages (some were taken from the most-read results from 6/22/2017):
If I don't strip any HTML and just add some of the markers and other changes needed MCS read-html adds around 1.9% to the payload.
If stripping of unneeded tags is included MCS read-html reduces the payload from 23% to 47% (avg. around 37%).

	Parsoid		No stripping transformations			With stripping transformations
title	parsoid	parsoid-gz	read-html	read-html-gz	% (gz)	read-html	read-html-gz	% (gz)
Barack_Obama	1756357	296667	1951413	302902	102.10%	836415	170284	57.40%
San_Francisco	1129002	186815	1245967	190644	102.05%	608535	120864	64.70%
Earth	1009182	184601	1102470	187190	101.40%	486411	98758	53.50%
Albert_Einstein	718942	134236	807682	137983	102.79%	398709	90833	67.67%
Wonder_Woman_(2017_film)	622614	101675	702077	103736	102.03%	275642	59207	58.23%
History_of_science	578812	115748	628118	117743	101.72%	363261	85091	73.51%
Oprah_Winfrey	569838	105697	631820	107577	101.78%	298886	69552	65.80%
2017_FIFA_Confederations_Cup	532768	58328	579347	59478	101.97%	337750	38523	66.05%
Deaths_in_2017	415456	75298	446385	75825	100.70%	256902	58443	77.62%
Kyoto	371852	65191	413469	66840	102.53%	233143	44709	68.58%
Transformers: The Last Knight	369805	59445	405663	60390	101.59%	186759	37381	62.88%
Daniel_Day-Lewis	350701	58550	372764	59181	101.08%	230637	42079	71.87%
Karen_Handel	283557	48199	314738	49150	101.97%	150275	32246	66.90%
Shooting_of_Philando_Castile	266892	46624	299378	47577	102.04%	132118	30112	64.58%
Travis_Kalanick	198806	33048	227356	33753	102.13%	76139	18069	54.68%
Gal_Gadot	166602	30182	185120	30675	101.63%	84820	18199	60.30%
Jon_Ossoff	150303	26271	170960	26968	102.65%	63610	14994	57.07%
Otto_Warmbier	140019	24901	159111	25490	102.37%	61983	14720	59.11%
Prodigy_(rapper)	110267	23652	121802	24006	101.50%	61401	16206	68.52%
Summer_solstice	69071	13086	72811	13187	100.77%	45073	9816	75.01%
Mobb_Deep	66576	15210	71933	15385	101.15%	40615	11060	72.72%
SUM	9877422	1703424	10910384	1735680	101.89%	5229084	1081146	63.47%

Very interesting. Seems in line with previous research 👍

• Fjalapeno added a project: Page Content Service.Jul 11 2017, 4:18 PM

The bulk of the space-saving transformations comes from https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transforms.js#L193-L268.

To add to this we want to

get the top 1000 page views of the month from PageViews AP automatically and
run this against zhwiki, too.

Here's a spreadsheet with the results against the top monthly titles in enwiki and zhwiki.

Here's a summary of the payload sizes I measured. The spreadsheet compares the gzipped (-6) payloads: plain Parsoid against various MCS/PCS read-html variants: no stripping, just stripping of unneeded markup, just stripping of reference lists, stripping both.

enwiki:

no stripping: adds about 2% to the payload
stripping of unneeded markup: saves around 36%
stripping of reference lists: saves around 36%
stripping of both unneeded markup and reference lists: saves around 53%

The savings of stripping reflists varies more than the savings of unneeded markup. Adding reflist stripping to already stripping unneeded markup only save almost around 17% points of the original payload.

zhwiki:

no stripping: adds about 1.2% to the payload
stripping of unneeded markup: saves around 36%
stripping of reference lists: saves around 15%
stripping of both unneeded markup and reference lists: saves around 42%

Adding reflist stripping to already stripping unneeded markup only save almost around 7% points of the original payload.

This leads to the question: is ref stripping really worth the hassle?

Screen shot of the averages…

@bearND looking over it seems like both striping reflists and removing unneeded markup both give a significant savings (Both about 36% on en). But as you noted: when combined both isn't as drastic as you would have thought (About 53% on en). Having said that, an extra 16% savings isn't insignificant.

zh seems to be not as heavily referenced for sure. This may be a cultural thing.

As far as next steps are concerned, I think we have enough data to proceed with the parsoid markup stripping in the PCS.

For the references, lets keep pushing on the structured data exploration and see how that turns out and then make decisions after we determine if we can reliably parse that data.

Sound like a plan to you?

Yes, sounds like a plan. I was just a bit surprised to see that the savings don't get close to adding up. My explanation for this is that the transforms for stripping the unneeded markup heavily strip the reference list content. There's still one included which I'll have to take out: stripping of ref back links. For the Android app we remove them but for the web case I think we would want to preserve them.

phuedx subscribed.Jul 18 2017, 5:09 PM

I looked a bit into the effect of stripping specific elements, focusing on those that aren't essential for editing. We plan to move data-mw out in any case (see T78676), so I started with that. All ratios are relative to the original Parsoid HTML, all using gzip -6.

title	data-mw	data-mw, ids	data-mw, ids, about	data-mw, ids, about, rel
Barack_Obama	.84	.74	.71	.70
San_Francisco	.88	.78	.75	.74
Earth	.76	.66	.63	.62
Albert_Einstein	.88	.79	.76	.75
Wonder_Woman_(2017_film)	.87	.75	.71	.70
History_of_science	.92	.82	.80	.79
Oprah_Winfrey	.89	.78	.75	.74
2017_FIFA_Confederations_Cup	.90	.78	.75	.74
Deaths_in_2017	.95	.82	.77	.75
Kyoto	.85	.78	.76	.74
Transformers:_The_Last_Knight	.88	.76	.73	.72
Daniel_Day-Lewis	.91	.81	.79	.77
Karen_Handel	.89	.80	.77	.76
Shooting_of_Philando_Castile	.89	.78	.75	.74
Travis_Kalanick	.87	.75	.70	.70
Gal_Gadot	.87	.74	.71	.70
Jon_Ossoff	.87	.74	.70	.70
Otto_Warmbier	.84	.73	.68	.68
Prodigy_(rapper)	.87	.75	.71	.71
Summer_solstice	.94	.84	.82	.81
Mobb_Deep	.91	.82	.80	.79
avg	0.88	0.77	0.74	0.73

So I think it is quite clear that moving out data-mw & reducing the number of ids would already realize a significant portion of the gains you saw with the more aggressive removal. IDs are currently used to anchor metadata, but we could replace this with a lighter weight scheme using path selectors (nth-child etc).

About attributes are heavily used to associate references with their reflist entry. They are somewhat redundant with the href. Both about & href could be removed if we inlined reference content.

The vast majority of rel attributes are mw:Wikilink or Extlink. This information is already implicit in the link target, so could just be removed without loss of information or editing support. However, this does not make a huge difference as they compress away mostly.

Quick & dirty script used to generate the numbers above:

#!/bin/bash

if [ -z $1 ];then
    echo "Usage: $0 <title>"
    exit 1
fi

file="$1.html"
curl -s "https://en.wikipedia.org/api/rest_v1/page/html/$1" | gzip -6 > "$file"
size_raw=$(wc -c < "$file")
size_data_mw=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" | gzip -6 | wc -c)
ratio_data_mw=$(bc <<< "scale=2; $size_data_mw / $size_raw")

size_data_mw_ids=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ id="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids=$(bc <<< "scale=2; $size_data_mw_ids / $size_raw")


size_data_mw_ids_about=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ \(id\|about\)="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids_about=$(bc <<< "scale=2; $size_data_mw_ids_about / $size_raw")

size_data_mw_ids_about_rel=$(zcat $file | sed -e 's/ data-mw="[^"]\+"//g' -e "s/ data-mw='[^']\+'//g" -e 's/ \(id\|about\|rel\)="[^"]\+"//g' | gzip -6 | wc -c)
ratio_data_mw_ids_about_rel=$(bc <<< "scale=2; $size_data_mw_ids_about_rel / $size_raw")

echo "|$1|$ratio_data_mw|$ratio_data_mw_ids|$ratio_data_mw_ids_about|$ratio_data_mw_ids_about_rel|"

I called this on a list of titles using

cat titles.txt | while read t; do ./checkSizes.sh "$t"; done

• GWicke mentioned this in T78676: Store & load data-mw separately.Jul 27 2017, 8:24 PM

Thank you for this. For some of the pages I measured this looks quite close to the more aggressive markup stripping MCS is doing (usually < 15% points). I agree that the rel attributes don't make much difference.

In T164033#3444391, @bearND wrote:

Yes, sounds like a plan. I was just a bit surprised to see that the savings don't get close to adding up. My explanation for this is that the transforms for stripping the unneeded markup heavily strip the reference list content.

I noticed the same when doing research in the past. My hypothesis was that one as well. I realized trying to make the pie charts add up to 100%, it doesn't work because some transforms overlap with others.

• GWicke triaged this task as Medium priority.Aug 8 2017, 9:32 PM

Change 361466 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add scripts to measure payloads

https://gerrit.wikimedia.org/r/361466

Mentioned in SAL (#wikimedia-releng) [2017-09-06T21:32:14Z] <bearND> Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Stashbot mentioned this in T167921: Support Lazy loading of page content not needed for first paint .Sep 6 2017, 9:32 PM

Stashbot mentioned this in T174698: Parenthetical stripping is too aggressive.

Stashbot mentioned this in T174808: Add swagger spec for summary endpoint.

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)

• bearND moved this task from Code Review to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Sep 7 2017, 4:38 AM

• bearND closed this task as Resolved.Sep 12 2017, 10:07 PM

	F8788700: Screen Shot 2017-07-17 at 10.48.33 AM.png
	Jul 17 2017, 2:56 PM

Test size of "Reading stripped" HTML vs non-stripped HTMLClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Test size of "Reading stripped" HTML vs non-stripped HTML
Closed, ResolvedPublic
Actions

Related Objects
Search...