Page MenuHomePhabricator

[Bug] Extreme latency due to JavaScript parsing on DOM-heavy pages
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Go to https://en.m.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units on a mobile device (also reproduced on https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units)
  2. Attempt to scroll or tap the menu

Expected results

menu opens, page scrolls, etc

Observed results

Page is unresponsive

Environments observed

  • Browser version: Chrome 73.0 , Firefox 66.0.2
  • OS version: Android 9
  • Device model: reproduced on Moto g7 plus, unsure of device of original report

Notes:

  • originally reported via email
  • original reporters noticed a javascript task running for a really long time (~7s)
  • It appears this is in large part due to the pageIssues.js code in Minerva and is most noticeable when a page has a lot of nodes (the reported pages on this ticket each have over 18,000 nodes). As seen from the screenshot of the profiler, the initPageIssues function takes around 3 - 5 seconds to complete and blocks the main thread. During this time, many this.$el.find( HEADING_SELECTOR ) calls from Page.js are made each taking around 50ms. If that call is cached as done in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/503130/, this blocking from the pageIssues.js code is greatly reduced (from ~3 -5s to 78ms.
  • It also appears that the the execution of Toggler.js takes 700ms. It appears it is causing lots of layout thrashing which contributes to its slow execution.
  • In summary, the cause of this jank is multifactorial and there are multiple performance optimizations we can make in the MF code

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2019, 6:20 PM
ovasileva triaged this task as High priority.Apr 11 2019, 6:20 PM
ovasileva updated the task description. (Show Details)Apr 11 2019, 6:36 PM
ovasileva updated the task description. (Show Details)Apr 11 2019, 6:48 PM
Isarra added a subscriber: Isarra.Apr 11 2019, 7:05 PM

At a glance, I'd say there's a good chance this has something to do with minerva's special js-rendering of the table of contents, especially as combined with the very long tables of contents on these pages.

nray updated the task description. (Show Details)Apr 11 2019, 8:05 PM
nray updated the task description. (Show Details)

Change 503130 had a related patch set uploaded (by Nray; owner: Nray):
[mediawiki/extensions/MobileFrontend@master] Memoize Page.js headings

https://gerrit.wikimedia.org/r/503130

nray updated the task description. (Show Details)Apr 11 2019, 11:01 PM
nray updated the task description. (Show Details)Apr 11 2019, 11:04 PM

Change 503130 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Store Page.js headings in instance variable

https://gerrit.wikimedia.org/r/503130

ovasileva added a subscriber: nray.

pulling this into the sprint as it's actively being worked on. @nray - should we put it in QA?

nray updated the task description. (Show Details)Apr 15 2019, 5:27 PM
nray updated the task description. (Show Details)Apr 15 2019, 6:28 PM
nray updated the task description. (Show Details)
nray updated the task description. (Show Details)Apr 15 2019, 7:05 PM
nray updated the task description. (Show Details)Apr 15 2019, 9:12 PM
nray added a comment.Apr 16 2019, 6:50 AM

pulling this into the sprint as it's actively being worked on. @nray - should we put it in QA?

As we discussed some in standup, the patch I put up (that's now merged) tries to focus on reducing one cause (the page issues code) of the freezing that is happening. That patch greatly reduces it but doesn't completely eliminate it as there are other contributing factors that weren't addressed by my patch such as the Toggler.js code.

The patch was pretty small (2 lines) and shouldn't have changed the overall behavior of the page issues code. I took a look at the following pages to test if the page issues was still working:

These appeared okay to me. Given this and if it's okay with you, I would be content with holding off on QA for now and sending this ticket back into the backlog so that the remaining work can be estimated.

ovasileva set the point value for this task to 5.

moving to blocked on others so we can test once it's on production

phuedx added a subscriber: phuedx.Apr 25 2019, 1:40 PM

I'd meant to follow up on this:

Notes:

  • originally reported via email

The original reporter emailed to say that they consider this resolved on their side.

Per T220751#5113858, I'll move this task into Needs Analysis in the backlog and assign it to @nray so that they can add necessary detail about the toggler, for example.

phuedx reassigned this task from ovasileva to nray.Apr 25 2019, 1:40 PM
phuedx lowered the priority of this task from High to Normal.
phuedx moved this task from Upcoming to Needs Prioritization on the Readers-Web-Backlog board.
Jdlrobson added a subscriber: Jdlrobson.

I would be content with holding off on QA for now and sending this ticket back into the backlog so that the remaining work can be estimated.

It looks like this was estimated?

phuedx removed the point value for this task.

Per T220751#5137468, above.

Peter added a subscriber: Peter.
nray added a comment.Apr 26 2019, 4:31 PM

fyi, I am aiming to to do more analysis on this next Monday, April 29

Peter added a comment.EditedApr 29 2019, 12:57 PM

Hi @nray can we do this in two steps? Since we missed it, I think we should try out first to instrument it. We could use https://developer.mozilla.org/en-US/docs/Web/API/User_Timing_API (we already use that for the startup module)? If we can make sure that we measure it, and push it first, then we get some data from real users, before you push the fix. That way we can make sure it is a real issue today, fix it and enjoy the win of the fix and maybe blog about it :)

Let me know if you need any help! I started today to test out how we can find the issue in our automatic performance tests on mobile, that will be there when we have home for out mobile performance lab T197847.

kchapman moved this task from Inbox to Radar on the Performance-Team board.Apr 29 2019, 8:00 PM
kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.
nray added a subscriber: pmiazga.May 1 2019, 2:48 PM

Hi @nray can we do this in two steps? Since we missed it, I think we should try out first to instrument it. We could use https://developer.mozilla.org/en-US/docs/Web/API/User_Timing_API (we already use that for the startup module)? If we can make sure that we measure it, and push it first, then we get some data from real users, before you push the fix. That way we can make sure it is a real issue today, fix it and enjoy the win of the fix and maybe blog about it :)
Let me know if you need any help! I started today to test out how we can find the issue in our automatic performance tests on mobile, that will be there when we have home for out mobile performance lab T197847.

Hi @Peter, I'm sorry for the delayed response. I'm unexpectedly OOO right now, but will look into this when I get back. We actually deployed a quick fix in If4f0bca79c4531c7be20ae17ad0ebe0b15cc75d0 to alleviate the biggest source of the main thread blockage, but I think there is other code (our section Toggler/Lazy image Loader code) that contributes to this problem and could use improvement. Instrumentation sounds good to me and might coincide with a ticket that @pmiazga created yesterday at https://phabricator.wikimedia.org/T222163

Jdlrobson renamed this task from [Bug] Extreme latency on some pages to [Bug] Extreme latency due to JavaScript parsing on DOM-heavy pages.May 1 2019, 6:15 PM
nray added a comment.May 24 2019, 1:41 AM

@Peter when you have a chance, I could use some help in interpreting what the following two long grey tasks each blocking the main thread for around 3 seconds represent. I've marked the tasks in question with question marks.

This is a profile of the https://en.m.wikipedia.org/wiki/Barack_Obama page on my phone (Pixel 2) with multiple sections open. Here are the steps I took during the profile.

  1. Start the Chrome performance profile
  2. Go to https://en.m.wikipedia.org/wiki/Barack_Obama on my phone after having opened multiple (top 3 - 4 sections) previously (those sections are saved to local storage)
  3. Shortly after first paint, scroll down fast until scrolling freezes (at least on my android phone)
  4. Stop the profile when the scrolling unfreezes and the page successfully paints

Here is also a video of the problem I'm seeing: https://practical-pasteur-ded885.netlify.com/#/16

I've uploaded the json of the profile to https://gofile.io/?c=Xsh7FA so you can import it if you'd like.

I'm confused what these tasks are and what causes them because they don't show any JS executing beneath them; they only show a few "Recalculate Styles" and "Update layer tree" rendering steps towards the start of each task, but then seemingly don't do anything for seconds.

Thank you for any additional insight you can provide!

nray added a comment.Jun 7 2019, 6:22 PM

I know this task has been lingering and so I wanted to provide an update on this:

  • I see https://phabricator.wikimedia.org/T206354 as part of our jank problems and think we should prioritize that one as it is probably the lowest hanging fruit. With that one, I believe that we could at least free up some memory by removing the .mw-mf-page-center composite layer and may be able to improve the menu animation performance as well.
  • There is still layout thrashing involved with the Toggler.js code that surfaces when the sections are open (i.e. when the screen size is large enough to default to all sections open). I'll create a new task for this.
  • As mentioned in https://phabricator.wikimedia.org/T220751#5209545, I'm still not sure what causes some of the long grey tasks (with no JS under them) I'm noticing when I profile certain pages. I think it might be correlated with DOM size and render tree size, but that's just a guess. Either way, I'm at my depth of knowledge with that and don't have any answers for those at this time. If anyone has any insight into what those are, I would love to hear it and that would help me determine a direction to go.
Peter claimed this task.Jun 12 2019, 6:37 PM

Oh @nray sorry I missed this, let me have a look first thing tomorrow.

Peter added a comment.EditedJun 13 2019, 8:46 AM

I've had a go today to first try to automate it, so we can find these kind of things in the future when we have a performance device lab. First @nray for the original, that has been solved right?

I had a go running on my Moto G5 using Browsertime (but you can probably use whatever tool you want like puppeteer or Selenium just to get the data).

I've made a script that navigates to the one of the problem pages and then test to scroll:
scroll.js

module.exports = async function (context, commands) {
 await commands.measure.start('Units');
 await commands.navigate('https://en.m.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units');

 for (var i = 0; i <= 500; i++) {
  await commands.js.run('window.scrollBy(0,100)');
  await commands.wait.byTime(2);
 }
 await commands.wait.byTime(1000);
 return commands.measure.stop();
};

On my phone I have Chrome 74, but on Browsertime we have Chromedriver 75, so I downloaded 74 and used that:

browsertime --android --cpu scroll.js --multi -n 1 --chrome.chromedriverPath ./chromedriver

That catches of long tasks nothing like 8 seconds. Longest one being 500+ ms.

Or should that long task still be there?

For the Obama page: Yep I get the same when I scroll super fast on my device, and I think you are right it Chrome try to render but isn't showing it correct in the trace log. We could maybe open a upstream task but I wonder is this scenario that people actually do? :)

I also did test with scrolling on Obama so we can automate that too it the future:
obama.js

module.exports = async function(context, commands) {
  await commands.measure.start('Obama');
  await commands.navigate('https://en.m.wikipedia.org/wiki/Barack_Obama');

  // Open a couple of sections
  await commands.click.bySelector('#mw-content-text > div > h2:nth-child(4) > div');
  await commands.click.bySelector('#mw-content-text > div > h2:nth-child(16) > div');
  await commands.click.bySelector('#mw-content-text > div > h2:nth-child(8) > div');

 //  Scroll
  for (var i = 0; i <= 200; i++) {
   await commands.js.run('window.scrollBy(0,300)');
  }
  await commands.wait.byTime(1000);
  return commands.measure.stop();
};

And running it like:

browsertime --android --cpu obama.js --multi -n 1 --chrome.chromedriverPath ./chromedriver

But I cannot reproduce that scrolling. The difference is that when I test manually I start to scroll directly before the loadEventEnd has happend. Let me try to emulate that too.

Peter added a comment.Jun 13 2019, 9:40 AM

At least in theory you can tell the webdriver that you wanna have the control back asap the navigation has started (pageLoadStrategy on the webdriver world).

I tested a couple variants of the script, trying to scroll without open any more content, wait a bit:

module.exports = async function(context, commands) {
  await commands.measure.start('Obama');
  await commands.navigate('https://en.m.wikipedia.org/wiki/Barack_Obama');

  await commands.wait.byTime(400);

  await commands.click.bySelector(
    '#mw-content-text > div > h2:nth-child(4) > div'
  );
  await commands.click.bySelector(
    '#mw-content-text > div > h2:nth-child(16) > div'
  );
  await commands.click.bySelector(
    '#mw-content-text > div > h2:nth-child(8) > div'
  );

  for (var i = 0; i <= 200; i++) {
    await commands.js.run('window.scrollBy(0,1000)');
    // await commands.wait.byTime(2);
  }
  await commands.wait.byTime(1000);
  return commands.measure.stop();
};

And then run it like this:

browsertime --android --cpu barack.js --multi -n 1 --chrome.chromedriverPath ./chromedriver --pageLoadStrategy "none" --pageCompleteCheck "return true;"

(I document it so I myself will remember).

But I cannot get the same behavior as when I try it manually (the HTML actually renders on the phone). There's one long task (1s+ but that is rendering the content). Also the timings DCL/L gets a little bit confused in the trace.

Personally I wouldn't spend any time on the layout that doesn't appear when crazy-scrolling, or do you see it on other cases @nray ?

For long tasks, I really wanna automate that so we can automatically alert on that. Scrolling/clicking on navigation could probably find a couple of cases if we choose pages wisely.

Peter reassigned this task from Peter to nray.Jun 13 2019, 9:41 AM

Signing over to you Nick to answer the questions.

One last thing: It's good to turn on screenshots, that makes easier to understand what happens (--chrome.enableTraceScreenshots). Also I tried to add annotations to the timeline console.timeStamp("Test"); but I cannot get them to be visible in the gui (I see them in the trace log though).

What's interesting is that we can see the cost of opening large sections. Here are the three clicks and one of them take 1 seconds to redraw the content:

But that's another task :)

nray added a comment.Jun 18 2019, 12:06 AM

@Peter Thanks so much for your detailed analysis! The browsertime tool looks cool.

I've had a go today to first try to automate it, so we can find these kind of things in the future when we have a performance device lab. First @nray for the original, that has been solved right?

Yes, the original problem described by this ticket (~7s latency on https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units et al.) was alleviated in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/503130/ and has been in production for awhile now.

That catches of long tasks nothing like 8 seconds. Longest one being 500+ ms. Or should that long task still be there?

Nope, it shouldn't be there anymore so I'm glad you are not noticing that!

But I cannot get the same behavior as when I try it manually (the HTML actually renders on the phone). There's one long task (1s+ but that is rendering the content). Also the timings DCL/L gets a little bit confused in the trace.

I could only reproduce on my mobile device as well. I was not able to reproduce on my laptop either. That's interesting to see two DCL's! I've never seen that before :)

Personally I wouldn't spend any time on the layout that doesn't appear when crazy-scrolling, or do you see it on other cases @nray ?

I'll keep my eye out for other instances, but so far I think I've only been able to replicate it when crazy-scrolling. Sounds good to me though, I think between T225946 and https://phabricator.wikimedia.org/T206354#5209465, we have a good amount of performance related improvements to start with.

What's interesting is that we can see the cost of opening large sections. Here are the three clicks and one of them take 1 seconds to redraw the content:

Yeah the section toggling code is a problem. I noticed/added that as a bullet point in the Notes section of the task description when I was first analyzing it. I'm really glad to see that T225946 was created today!

nray added a comment.Jun 18 2019, 12:08 AM

@ovasileva Would you mind if I closed this task out? It looks like the original problem that the task was created for has been resolved, and we have more granular tasks in T225946 and T206354 to deal with the remaining performance issues.

nray closed this task as Resolved.Jun 24 2019, 7:09 PM

After discussing with @ovasileva, I'm closing this task out as the original problem has been resolved (and has been in production for awhile now). Furthermore, the additional toggler/lazy image load performance problems that I listed in the task description are captured well in T225946.