Strip certain HTML markup from the DOM e.g. navboxes
Closed, DeclinedPublic

Description

A while back we introduced display: none [1] to hide content in the DOM rather than stripping it from the HTML output.
We of course never measured the impact of this.
I'd like to propose we explore this avenue again in a controlled manner.
Given the navbox element can account for around 12% [1] this could be causing issues.

Hypothesis: Reducing the unnecessary HTML we send down the wire has more benefits than the extra cost on parsing and stripping newly saved content.
To measure:

  • When a cookie is set / query string parameter is passed / for a given page strip any selector in this selector from the DOM before sending. Run tests on different connection speeds and report the difference in time to first paint under these two conditions.

There is evidence here: https://phabricator.wikimedia.org/T119797#1899560

In an ideal world we would want to allow lazy loading of stripped elements e.g. leave a link in their place to Special:MobileFragment/<page title>/<sha id>

[1] https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/master/resources/skins.minerva.content.styles/hacks.less#L17

Jdlrobson updated the task description. (Show Details)
Jdlrobson raised the priority of this task from to High.
Jdlrobson added subscribers: AndyRussG, Jhernandez, Gilles and 13 others.
Jdlrobson renamed this task from Use MobileFormatter to strip certain HTML markup from the DOM to Strip certain HTML markup from the DOM.Sep 23 2015, 6:48 PM

Joaquin, It seems the work you are doing at the moment is basically this. Adding you in case it helps you (feel free to add to sprint if necessary)

Volker_E updated the task description. (Show Details)Nov 4 2015, 12:19 AM
Jdlrobson moved this task from Backlog to Tasks on the MobileFrontend board.Nov 4 2015, 5:21 PM
Jdlrobson updated the task description. (Show Details)Dec 22 2015, 11:11 PM
Jdlrobson removed Jhernandez as the assignee of this task.
Jdlrobson renamed this task from Strip certain HTML markup from the DOM to Strip certain HTML markup from the DOM e.g. navboxes.Jan 11 2016, 9:28 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)Jan 12 2016, 12:07 AM

I created a simple script to measure the impact of making use of wgMFRemovableClasses
https://github.com/jdlrobson/mediawiki-extensions-MobileFrontend/commit/802dae33c97d956b0018a5b9df1b9e2897e231ce
I ran this in PHP 5.5.27 so in HHVM it might perform a lot better.

I created some static files, and then tried the formatter with different values of wgMFRemovableClasses
Results are given in seconds. As can be seen in the extreme cases 0.3s (120% increase in time taken) could be added to the processing time of an article when we remove bits of HTML from the parser.

As the number of selectors increases time taken also increases... eep.

It also seems we do not cache the result of MobileFormatter (which explains why pages with images disabled load slower).

Summary: With better caching use of MobileFormatter may be acceptable provided the number of selectors is minimal.

Results:
Elapsed: samples/blank.html #noremovals 0.000005
Elapsed: samples/blank.html #nonavbox 0.000056
Elapsed: samples/blank.html #nonavboxnoreferences 0.000068

Elapsed: samples/barack.html #noremovals 0.000011
Elapsed: samples/barack.html #nonavbox 0.135939
Elapsed: samples/barack.html #nonavboxnoreferences 0.196194

Elapsed: samples/nike.html #noremovals 0.000230
Elapsed: samples/nike.html #nonavbox 0.023470
Elapsed: samples/nike.html #nonavboxnoreferences 0.027838

Elapsed: samples/starwars.html #noremovals 0.000058
Elapsed: samples/starwars.html #nonavbox 0.083880
Elapsed: samples/starwars.html #nonavboxnoreferences 0.096470

Elapsed: samples/drwho.html #noremovals 0.000142
Elapsed: samples/drwho.html #nonavbox 0.055787
Elapsed: samples/drwho.html #nonavboxnoreferences 0.067384

Elapsed: samples/geastrum.html #noremovals 0.000102
Elapsed: samples/geastrum.html #nonavbox 0.007934
Elapsed: samples/geastrum.html #nonavboxnoreferences 0.011701

Elapsed: samples/oakland.html #noremovals 0.000021
Elapsed: samples/oakland.html #nonavbox 0.061142
Elapsed: samples/oakland.html #nonavboxnoreferences 0.092299

Elapsed: samples/campus.html #noremovals 0.000269
Elapsed: samples/campus.html #nonavbox 0.003011
Elapsed: samples/campus.html #nonavboxnoreferences 0.003754

Elapsed: samples/syrianwar.html #noremovals 0.000033
Elapsed: samples/syrianwar.html #nonavbox 0.191538
Elapsed: samples/syrianwar.html #nonavboxnoreferences 0.270370

Elapsed: samples/brazil.html #noremovals 0.000322
Elapsed: samples/brazil.html #nonavbox 0.102064
Elapsed: samples/brazil.html #nonavboxnoreferences 0.135670

@Jdlrobson Good results, numbers are pretty high which justifies Max disabling it on the past.

In any case, to really asses the validity of the numbers we should run the same tests on an environment as close as we can to production. The server's processing power and hhvm may change completely the numbers.

If that is not possible, (and probably in any case) we should focus in deploying an experiment to see the impact with real numbers.

  • Can we measure server response time per mobile mode? (stable vs beta)
phuedx added a subscriber: phuedx.Jan 20 2016, 10:55 AM

A good start @Jdlrobson.

It's important to remember that this is class-based element matching, which is already known to be more inefficient than matching by ID, and yet more inefficient than matching by tag name.

I created a simple script to measure the impact of making use of wgMFRemovableClasses
https://github.com/jdlrobson/mediawiki-extensions-MobileFrontend/commit/802dae33c97d956b0018a5b9df1b9e2897e231ce
I ran this in PHP 5.5.27 so in HHVM it might perform a lot better.

I created some static files, and then tried the formatter with different values of wgMFRemovableClasses
Results are given in seconds. As can be seen in the extreme cases 0.3s (120% increase in time taken) could be added to the processing time of an article when we remove bits of HTML from the parser.

I've found the results quite hard to parse initially. Could you:

  • tabulate the data
  • give the times in milliseconds
  • display the relative increase in processing time between the three conditions
  • display the size of the input (again, with reasonable units)
  • highlight the extreme cases

As the number of selectors increases time taken also increases... eep.

I'd expect this to be true. The selectors are processed serially.

It also seems we do not cache the result of MobileFormatter (which explains why pages with images disabled load slower).

Summary: With better caching use of MobileFormatter may be acceptable provided the number of selectors is minimal.

We've spoken about this before, but unfortunately never written it down anywhere (?). In particular we spoke about storing the output of MobileFormatter when a page is parsed in order to speed up output on MFβ – and perhaps not rely so heavily on Varnish for MFStable.

However, when you look at it, MobileFormatter is actually quite toothless: it does a lot of work for Main_Page and mostly <section>s up all other pages. Perhaps we could investigate getting the latter into the core parser?

@phuedx Here's a link. https://phabricator.wikimedia.org/T114072

Seems like they are looking into sectioning all levels, not just top level headings like mobile formatter does.

Maybe chime in there?

@Jdlrobson Good results, numbers are pretty high which justifies Max disabling it on the past.

The numbers are so small that if we could cache transformed output and do once, I'm not sure they are problematic. I think disabling it was not the right thing to do imo.

In any case, to really asses the validity of the numbers we should run the same tests on an environment as close as we can to production. The server's processing power and hhvm may change completely the numbers.

Agreed. I'm not quite sure how to do this. I could surface the profiling tool as a SpecialPage we could enable only on beta labs. Any thoughts around that?

If that is not possible, (and probably in any case) we should focus in deploying an experiment to see the impact with real numbers.

We could set this in beta and then measure impact on server response time... if we can measure that.

  • Can we measure server response time per mobile mode? (stable vs beta)

Not sure if we do that anywhere but that would make life a lot easier. @Peter do you know if we measure this anywhere?

It's important to remember that this is class-based element matching, which is already known to be more inefficient than matching by ID, and yet more inefficient than matching by tag name.

Sure. Given that classes are likely to be what we need to use this seems to be the way to go. (templates shouldn't be using IDs and tags without using custom tags are not plausible)

I've found the results quite hard to parse initially.

I've done as you suggested: https://docs.google.com/spreadsheets/d/1UUH1OeE2yh8VbYgjd4DzC5j1_6RZoNowAaLaG_JJpNI/edit?usp=sharing
Let me know if I can make this any clearer.

I'd expect this to be true. The selectors are processed serially.

Using one selector (nomobile) would actually be a better approach I suspect.

We've spoken about this before, but unfortunately never written it down anywhere (?). In particular we spoke about storing the output of MobileFormatter when a page is parsed in order to speed up output on MFβ – and perhaps not rely so heavily on Varnish for MFStable.

Yeh I think this is what we need to explore. If we can start caching the results of MobileFormatter it seems a no brainer to me that we start using this again.

However, when you look at it, MobileFormatter is actually quite toothless: it does a lot of work for Main_Page and mostly <section>s up all other pages. Perhaps we could investigate getting the latter into the core parser?

The idea here is to use MobileFormatter for removing HTML again - this will not get into core any time soon, so I'm reluctant to go down this path right now and be blocked by it (note my tests so far are only looking at the removals and I've not profiled the section wrapping at all). Support in Parsoid will clear the way for this but it seems there is little value in us focusing attention on this right now imho.

Using one selector (nomobile) would actually be a better approach I suspect.

This would certainly be faster than running N XPath queries against. A non-ideal optimisation would be to run one XPath query containing multiple selectors. As has been mentioned before, having a specific tag name to retrieve the element(s) with, e.g. DOMDocument#getElementsByTagName('nav-box'), would be ideal.

Jdlrobson lowered the priority of this task from High to Low.Feb 18 2016, 6:51 PM

Change 276809 had a related patch set uploaded (by Jdlrobson):
Strip rather than hide HTML

https://gerrit.wikimedia.org/r/276809

Change 276809 merged by jenkins-bot:
Strip rather than hide HTML

https://gerrit.wikimedia.org/r/276809