LOOT: Collect data on article composition
Closed, ResolvedPublic8 Story Points

Description

Back during the offsite Joaquin got some numbers around article composition e.g. how much does just serving lead section reduce HTML from. What does this mean in terms of time to load?

It's important we get these recorded down at the very least for Barack Obama, but preferably for a wide variety of articles (I'd suggest taking the last 40 featured articles of the day- as these represent our best content).

Specifically:

  • What % of total HTML is reference HTML.
    • How does removing this impact first paint / first interactive / total load time.
  • What % of total HTML is lead section HTML.
    • How does removing this impact first paint / first interactive / total load time.
  • How does removing images impact first paint / first interactive / total load time.

Note we have this data but as I understand it, it only relates to slim vs original and for the Barack Obama article. It would be useful to have similar data for other articles / less optimised Parsoid (one common question I get is why not just remove/downgrade quality of images for example).

I think this data is vital for the dev summit when having a discussion around defer loading content.
Feel free to divide and conquer this card.


Research notes in mediawiki.org: https://www.mediawiki.org/wiki/Reading/Web/Projects/A_frontend_powered_by_Parsoid/HTML_content_research

  • Define approach
  • Create endpoints (@phuedx)
  • Create data gathering script (@Jhernandez)
  • Gather data with initial list of articles (see appendix A for list)
  • Visualize data to gather knowledge and insights

Appendix A

A company https://en.m.wikipedia.org/wiki/Nike,_Inc.
A movie https://en.m.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens
A person https://en.m.wikipedia.org/wiki/Barack_Obama
A TV show https://en.wikipedia.org/wiki/Doctor_Who
A science article https://en.m.wikipedia.org/wiki/Geastrum_quadrifidum
An event https://en.m.wikipedia.org/wiki/Syrian_Civil_War
A city https://en.m.wikipedia.org/wiki//Oakland
A stub https://en.wikipedia.org/wiki/Campus_Honeymoon
A country https://en.wikipedia.org/wiki/Brazil

Jdlrobson created this task.Dec 5 2015, 1:32 AM
Jdlrobson updated the task description. (Show Details)
Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson added a subscriber: Jdlrobson.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 5 2015, 1:32 AM
Jdlrobson set Security to None.Dec 5 2015, 1:32 AM
Jdlrobson edited a custom field.

@Jdlrobson, I don't think featured articles represent a wide variety, as they will all be wordy, well referenced and use pictures. I would recommend including stubs, etc.

Jdlrobson assigned this task to Jhernandez.

Cleared this with Joaquin. We decided to allow the API to take parameters that change its behaviour and we're going to run some tests comparing against api.php to get to the bottom of an article's composition. He's gonna run some tests.

Adding @Tbayer for visibility and bi-directional consult as needed.

Jhernandez updated the task description. (Show Details)Dec 16 2015, 10:22 AM
Jhernandez added a subscriber: phuedx.
Jhernandez reassigned this task from Jhernandez to phuedx.Dec 16 2015, 10:25 AM
Jhernandez added a subscriber: Jhernandez.

Assigning to @phuedx for the time being, we're working in parallel on this one.

@Jdlrobson, @dr0ptp4kt, @JKatzWMF It would be great if you guys come up with lists of pages to test, they can have different focuses (most visited, random, manually chosen). It doesn't need to be definitive lists, we can iterate on them. We'll set up the pipeline so that it's easy to re-run the analysis with a new set of pages.

Jdlrobson added a comment.EditedDec 16 2015, 11:04 PM

My feeling here is that we should cover a wide variety of /types/ of article (unless @dr0ptp4kt, @JKatzWMF have better suggestions).
Here are some manual ones to start with:

  1. A company https://en.m.wikipedia.org/wiki/Nike,_Inc.
  2. A movie https://en.m.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens
  3. A person https://en.m.wikipedia.org/wiki/Barack_Obama
  4. A TV show https://en.wikipedia.org/wiki/Doctor_Who
  5. A science article https://en.m.wikipedia.org/wiki/Geastrum_quadrifidum
  6. An event https://en.m.wikipedia.org/wiki/Syrian_Civil_War

I think there are 2 stories to tell here:

  1. Shifting from PHP Parser output to Parsoid (is Parsoid bigger/smaller? If so why?)
  2. The makeup of a Parsoid article - where does the weight for these articles come from? How much do they shrink by when we optimise them?

Right now it seems people are not sold that our HTML is a problem (they think images are the main problem) and this is what we should look to sell in the dev summit.

Thanks @Jhernandez for opening this up and @Jdlrobson for starting the list

A company https://en.m.wikipedia.org/wiki/Facebook

This is a confusing example because we say we improved the speed of facebook by x%. Let's just do Nike or some other brand.

A movie https://en.m.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens
A person https://en.m.wikipedia.org/wiki/Barack_Obama
A TV show https://en.wikipedia.org/wiki/Doctor_Who
A science article https://en.m.wikipedia.org/wiki/Geastrum_quadrifidum
An event https://en.m.wikipedia.org/wiki/Syrian_Civil_War

I think we should add a place, since I believe there is some high % of pages that are places: https://en.m.wikipedia.org/wiki/Oakland

We still don't know what % of our total pageviews come from the top 10,000 (long articles) out of 5M. It could be 10% it could be 80%. I just ran the numbers between top pages (http://top.hatnote.com/) and vitalsigns (https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews) and found that the top 10 articles represent .8%, but obvious dropoff. So I recommend atleast one other short-ass page.

Random gave me this: https://en.wikipedia.org/wiki/Campus_Honeymoon

Do we want/need to look at other languages?

Jdlrobson updated the task description. (Show Details)Dec 17 2015, 1:13 AM
Jdlrobson updated the task description. (Show Details)Dec 17 2015, 1:19 AM

Thanks @Jhernandez for opening this up and @Jdlrobson for starting the list

A company https://en.m.wikipedia.org/wiki/Faceboo

This is a confusing example because we say we improved the speed of facebook by x%. Let's just do Nike or some other brand.

Sounds good. Looks like a similar size.

A movie https://en.m.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens
A person https://en.m.wikipedia.org/wiki/Barack_Obama
A TV show https://en.wikipedia.org/wiki/Doctor_Who
A science article https://en.m.wikipedia.org/wiki/Geastrum_quadrifidum
An event https://en.m.wikipedia.org/wiki/Syrian_Civil_War

I think we should add a place, since I believe there is some high % of pages that are places: https://en.m.wikipedia.org/wiki/Oakland

Yup and also a country.

We still don't know what % of our total pageviews come from the top 10,000 (long articles) out of 5M. It could be 10% it could be 80%. I just ran the numbers between top pages (http://top.hatnote.com/) and vitalsigns (https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews) and found that the top 10 articles represent .8%, but obvious dropoff. So I recommend atleast one other short-ass page.

Random gave me this: https://en.wikipedia.org/wiki/Campus_Honeymoon

Good call.

Do we want/need to look at other languages?

No. HTML size would be similar for all these type of articles plus the app doesn't support other languages right now. Have added an appendix A to description.

There's a little tidy up to do on the new benchmark routes but @Jhernandez has started gathering data.

Here's a version of the html report with the list of pages on the description.

http://chimeces.com/loot-content-analysis/

I'll explain later better what the endpoints mean and why there are 2 pie charts.

It's pretty enlightening.

phuedx reassigned this task from phuedx to Jhernandez.Dec 18 2015, 3:00 PM

Cool!

I will pull together a query to identify a broader representative sampled set of candidate articles based on examination of the pageviews-per-article curve.

I've updated it with a description and a glossary of terms.

http://chimeces.com/loot-content-analysis/

We should talk about this and next steps.

phuedx updated the task description. (Show Details)Dec 21 2015, 9:28 AM

FYI https://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States is the longest article on Wikipedia according to https://en.wikipedia.org/w/index.php?title=Special:LongPages&redirect=no (in fact lots are list pages)

Might be interesting to cover this just because there are not so many images and it makes a case for staggered loading.

Jdlrobson added a comment.EditedDec 22 2015, 11:10 PM

Seems we have all we need now I've set up http://reading-web-transforms.wmflabs.org/benchmarks/loot/Barack Obama?transforms=noimages and run some tests (see https://phabricator.wikimedia.org/T119797#1899560)! Thanks Joaquin!

Jdlrobson closed this task as Resolved.Dec 22 2015, 11:10 PM
dr0ptp4kt added a comment.EditedDec 23 2015, 5:45 AM

@Jhernandez, @Jdlrobson, @phuedx (CC @JKatzWMF) - candidate articles at the bottom of this comment.

The basic approach was to find the top pages on English mobile web Wikipedia that constitute about 60% of pageviews, then sample about 30 pages from that, in addition to just getting the top 10 pages or so. This was done for the month of November 2015.

This is a fairly simple and crude way to do it (e.g., it doesn't take into account some of the nuance around the page_id field and, as you'll see later, Special:MobileMenu and probably some other possibly not-strictly-content is included), but should suffice for our purposes.

Top 0.5% articles

from (
select page_title, sum(view_count) ct
from wmf.pageview_hourly
where year = 2015 and month = 11
and access_method = 'mobile web'
and agent_type = 'user'
and project = 'en.wikipedia'
and page_title <> '-' and page_title <> 'Special:Search'
group by page_title) t
select percentile(cast(t.ct as BIGINT), 0.995);

result: 6944.50500000082

Pageviews for pages with counts in the top 0.5% articles

from (
select page_title, sum(view_count) ct
from wmf.pageview_hourly
where year = 2015 and month = 11
and access_method = 'mobile web'
and agent_type = 'user'
and project = 'en.wikipedia'
and page_title <> '-' and page_title <> 'Special:Search'
group by page_title) t
select sum(t.ct) where t.ct > 6944;

result: 1855245871

Sum of all pageviews

select sum(view_count)
from wmf.pageview_hourly
where year = 2015 and month = 11
and access_method = 'mobile web'
and agent_type = 'user'
and project = 'en.wikipedia'
and page_title <> '-' and page_title <> 'Special:Search';

result: 3090416456

In other words, the stuff at the 99.5th percentile and above accounts for about 60% of total pageviews.

How many rows are there of those top 0.5%? We'll want this number to define buckets for later Hive sampling.

from (
select page_title, sum(view_count) ct
from wmf.pageview_hourly
where year = 2015 and month = 11
and access_method = 'mobile web'
and agent_type = 'user'
and project = 'en.wikipedia'
and page_title <> '-' and page_title <> 'Special:Search'
group by page_title) t
select count(1) where t.ct > 6944;

result: 68872

Let's figure out out bucket size:

68872 / 30 = 2295.76(6-repetend)

Create a table with 2,295 buckets, so that we can sample from it

-- create database, use database

create external table toppages(`title` string, `ct` bigint)
clustered by(title) into 2295 buckets
location '<path>';

insert overwrite table toppages
select page_title, ct from (
select page_title, sum(view_count) ct
from wmf.pageview_hourly
where year = 2015 and month = 11
and access_method = 'mobile web'
and agent_type = 'user'
and project = 'en.wikipedia'
and page_title <> '-' and page_title <> 'Special:Search'
group by page_title) t
where t.ct > 6944 order by ct desc limit 1000000;

Get the top 11

select * from toppages limit 11;

Main_Page	205688614
Lady_Colin_Campbell	3568743
Islamic_State_of_Iraq_and_the_Levant	2500377
Special:MobileMenu	2226962
Jessica_Jones	2202863
Spectre_(2015_film)	2066743
Adele	1924355
Prem_Ratan_Dhan_Payo	1852385
Ronda_Rousey	1554774
Jessica_Jones_(TV_series)	1518772
November_2015_Paris_attacks	1406227

This has Special:MobileMenu, which is probably being requested by low/no-JS UAs. This may be an interesting thing to think about - for example if Home and Random are immediately available to low/no-JS UAs, how much more likely are people to have the opportunity of discovery (and at what relative bandwidth addition)? For the purposes of page loading aspects in this analysis, it may not be that interesting, though.

And now for a semirandom sampling across the buckets.

select * from toppages tablesample (bucket 1 out of 2295 on rand());

Mandana_Karimi	219359
List_of_The_Big_Bang_Theory_characters	41128
Gran_Torino	37829
Coywolf	36999
Color_blindness	34057
Blackface	30434
Piaget's_theory_of_cognitive_development	29705
Prateik_Babbar	25765
Squid	25013
Robert_Trujillo	18642
Tuppence_Middleton	17606
Canada_men's_national_soccer_team	14988
Kevin_Keegan	13659
Paul_Schneider_(actor)	13287
Bran_Castle	12501
Ja'net_Dubois	11549
Snellen_chart	11349
Hawker_Hurricane	9056
A_Horse_with_No_Name	8960
Biltong	8517
Marinara_sauce	7933
IJustine	7202
Devdas	7168
Big_Brother_(TV_series)	7040
Situation,_Task,_Action,_Result	7003

For the curious, here's what was roughly observed for different percentiles:

10% of articles comprise about 95% of pageviews
5% of articles comprise about 90% of pageviews
2% of articles comprise about 80% of pageviews
1% of articles comprise about 70% of pageviews
0.5% of articles comprise about 60% of pageviews

Nice data @dr0ptp4kt. I can run and expose other reports if it would be helpful under a different url.

@phuedx and I added extraneous markup to the report and sent an email. See http://chimeces.com/loot-content-analysis/ (additional ~12% of crap we should get rid of)

@Jhernandez - yes, it would be cool if you can scriptomatically run reports on these others.

For the purpose of the summit, if it's too much work to scriptomatically cover them all and you have to go manual, I would say grab the top two articles (from the top 11), then also find a medium and a small article out of the distributed sample (that is, out of the second list of 29 articles) and run reports on those. I think we "know" what the absolute outcome is in the extreme case of Loot hyperoptimization, but it's a matter of stacking that absolute outcome against the absolute outcome of the non-optimized experience on a broader corpus.

@dr0ptp4kt this is awesome and thanks for publishing the query and description to make it so easily replicated. I have been curious about this for a long time (and now even more curious to see how it differs on desktop v. apps).