Page MenuHomePhabricator

MobileFormatter has quadratic performance
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
matmarex
Sep 6 2022, 2:36 AM
Referenced Files
F35536715: Screen Shot 2022-09-27 at 1.26.45 PM.png
Sep 27 2022, 7:36 PM
F35536629: Screen Shot 2022-09-27 at 12.13.10 PM.png
Sep 27 2022, 6:29 PM
F35536443: ezgif-1-e6ab3314a3.gif
Sep 27 2022, 4:52 PM
F35515251: Screenshot 2022-09-09 at 21.28.32.png
Sep 9 2022, 7:30 PM
F35510366: Testwiki-20220906014211.xml
Sep 6 2022, 4:09 AM
F35510357: image.png
Sep 6 2022, 3:56 AM
F35510350: image.png
Sep 6 2022, 3:56 AM
F35510359: image.png
Sep 6 2022, 3:56 AM
Tokens
"The World Burns" token, awarded by Daimona."Haypence" token, awarded by ppelberg."Mountain of Wealth" token, awarded by Jdlrobson."Mountain of Wealth" token, awarded by ssastry.

Description

MobileFormatter has quadratic performance: the time it takes grows faster than the size of the input.

This is the root cause of timeouts noted at T204606#5484240, which led to the transforms being disabled on large pages (T248796), which is very sad because it's exactly those pages that would benefit the most from them.

I'd like to fix this.

Acceptance criteria

QA

  • Check the lead paragraph transform is still running
  • Check images are still transformed for lazy loading
  • Check sections are still collapsible

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
BeforeAfterBeforeAfter
1000 headings1.237 s0.092 s1000 images0.131 s0.030 s
2000 headings3.154 s0.125 s2000 images0.300 s0.045 s
3000 headings6.266 s0.182 s3000 images0.475 s0.063 s
4000 headings8.798 s0.240 s4000 images0.752 s0.075 s
5000 headings19.884 s0.298 s5000 images1.043 s0.095 s
6000 headings29.949 s0.355 s6000 images1.733 s0.103 s
7000 headings46.532 s0.411 s7000 images2.007 s0.113 s
8000 headings(timed out)0.476 s8000 images2.669 s0.134 s
9000 headings(timed out)0.497 s9000 images3.516 s0.150 s
10000 headings(timed out)0.576 s10000 images4.812 s0.163 s
Chart
image.png (680×1 px, 22 KB)
image.png (680×1 px, 21 KB)
Chart
image.png (680×1 px, 17 KB)
image.png (680×1 px, 22 KB)
Legoktm added a subscriber: Legoktm.

This would be a good candidate for Performance-Team-publish I think!

Change 829880 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/MobileFrontend@master] Replace getElementsByTagName() with DOMCompat::querySelectorAll() for performance

https://gerrit.wikimedia.org/r/829880

Change 829881 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/MobileFrontend@master] Set maxImages and maxHeadings to very high values

https://gerrit.wikimedia.org/r/829881

Change 829882 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/MobileFrontend@master] Report MobileFormatter timings in a HTML comment

https://gerrit.wikimedia.org/r/829882

@Legoktm wondered why I didn't use the same scale for the "before" and "after" charts. Well, there are certain issues with that presentation :D

BeforeAfterComparison
Images
image.png (680×1 px, 17 KB)
image.png (680×1 px, 17 KB)
image.png (751×1 px, 19 KB)
Headings
image.png (680×1 px, 22 KB)
image.png (680×1 px, 21 KB)
image.png (749×1 px, 24 KB)

@Legoktm wondered why I didn't use the same scale for the "before" and "after" charts. Well, there are certain issues with that presentation :D

I think it illustrates the impact of your change perfectly :D

Pages with 1000-10000 headings/images, for testing (import this to your local wiki):

Truly impressive results! I'm just making sure (via async communication with Bartosz cc'ed) we have capacity to push through and monitor this change before merging.

It took me a while to find this, but the whole issue is caused by using getElementsByTagName(). You'd think it's just a simple linear tree walk, but apparently it takes quadratic time. I couldn't find what makes the implementation so bad, but I'd be curious if anyone else tracked it down.

Replacing all calls to getElementsByTagName with DOMCompat::querySelectorAll() (provided by Parsoid) instantly fixes the problem.

DOMCompat::querySelectorAll is based on Zest which replaced PHPDOM's getElementsByTagName with its own implementation based on XPath in https://gerrit.wikimedia.org/r/c/mediawiki/libs/Zest/+/494559 that @cscott wrote to fix performance problems that @Tgr discovered in T215000. This indicates that perhaps we should do a code search for uses of getElementsByTagName and replace it with Zest::find or Parsoid's DOMCompat::querySelectorAll or DOMCompat::getElementsByTagName both of which are wrappers around Zest.

Flow, DonationsInterface, Translate, GWToolset, TextExtracts seem to be the other extensions that cold potentially be updated.

This indicates that perhaps we should do a code search for uses of getElementsByTagName and replace it with Zest::find or Parsoid's DOMCompat::querySelectorAll or DOMCompat::getElementsByTagName both of which are wrappers around Zest.

And also a lint rule to prevent it being used in the future.

Change 829881 abandoned by Bartosz Dziewoński:

[mediawiki/extensions/MobileFrontend@master] Set maxImages and maxHeadings to very high values

Reason:

Squashed into https://gerrit.wikimedia.org/r/829880

https://gerrit.wikimedia.org/r/829881

Change 829882 merged by jenkins-bot:

[mediawiki/extensions/MobileFrontend@master] Report MobileFormatter timings in a HTML comment

https://gerrit.wikimedia.org/r/829882

[…]

BeforeAfterComparison
Images
image.png (680×1 px, 17 KB)
image.png (680×1 px, 17 KB)
image.png (751×1 px, 19 KB)
Headings
image.png (680×1 px, 22 KB)
image.png (680×1 px, 21 KB)
image.png (749×1 px, 24 KB)

We also have the Grafana dashboard: Backend Pageview Timing which may be useful to monitor as this rolls out.

The status quo as has been for nearly a decade now is that we incur a significant "mobilefrontend tax" on pageviews, nearly doubling the latency to transmit pageviews to logged-in users on mobile compared to the e.g. the same device/network using the canonical domain.

Screenshot 2022-09-09 at 21.28.32.png (615×2 px, 158 KB)

@thiemowmde pointed out in code review that we'll need to change the limits in WMF config (https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/6d4325b36599a782c8943965da37aa319908d5a3/wmf-config/InitialiseSettings.php#18547) to actually allow transforms on large Wikipedia pages, which was my real goal here.

The status quo as has been for nearly a decade now is that we incur a significant "mobilefrontend tax" on pageviews, nearly doubling the latency to transmit pageviews to logged-in users on mobile compared to the e.g. the same device/network using the canonical domain.

Part of this is that the all MF transformations are applied on every page request. In DicsussionTools we apply our complex page transformations before the content is placed in the parser cache, and only a light set of substitutions is done on page request (accounting for things like user language and preferences).

It should be very possible to get the mobile site running at the same speed as desktop, or very close.

Change 829880 merged by jenkins-bot:

[mediawiki/extensions/MobileFrontend@master] Replace getElementsByTagName() with DOMCompat::querySelectorAll() for performance

https://gerrit.wikimedia.org/r/829880

ovasileva lowered the priority of this task from High to Medium.Sep 13 2022, 5:16 PM

Moved to block since this is rolling out on the train and next step involves reviewing real world metrics.

The change has rolled out during the week of 13-15 September. I looked at the dashboard linked by @Krinkle and there doesn't seem to be any change around that time: https://grafana.wikimedia.org/d/QLtC93rMz/backend-pageview-timing?orgId=1&from=1662336000000&to=1664150399000. This is expected, because we still have the limits in place that disable the transforms on large pages, and there is not much difference in performance on small pages.

I think we can relax/remove the limits now, and then confirm that this doesn't worsen the timings (…much), before closing this task.

Change 835245 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values

https://gerrit.wikimedia.org/r/835245

Jdlrobson added a subscriber: nray.

Hi sorry for dropping the ball on this. I too am not seeing any change in backend performance.

One thing I am concerned about however is frontend performance. On a few older test devices, with the mobile transform relaxed, the device was crashing due to over-use of memory on https://en.wikipedia.org/wiki/User:Jane023/Paintings_in_the_Metropolitan_Museum_of_Art. Thos didn't happen before, as the lazy loading code previously didn't run on these pages, and I think what is happening is multiple images are requested at the same time in JavaScript.

Do we log crash reports on these pages? If not could we add that as a prerequisite of removing/relaxing the limit?

Also paging @nray for any guidance around how to measure/monitor frontend performance on these complex pages. Perhaps it makes sense to have a specific webpagetest for these extreme cases or mark them in navigation timing data?

I do not think that is a blocker. If you're going to look into it, file a task.

Change 835245 merged by jenkins-bot:

[operations/mediawiki-config@master] wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values

https://gerrit.wikimedia.org/r/835245

Mentioned in SAL (#wikimedia-operations) [2022-09-26T20:03:49Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]]

Mentioned in SAL (#wikimedia-operations) [2022-09-26T20:04:11Z] <samtar@deploy1002> samtar and matmarex: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-26T20:10:03Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]] (duration: 06m 13s)

I see collapsible sections and lazy-loaded images on https://en.m.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh now. The debug comment with timing states <!-- MobileFormatter took 0.236 seconds --> (it varies between 0.2 and 0.25 on repeat page loads).

Change 835265 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/MobileFrontend@master] MobileFormatter: Remove maxImages and maxHeadings limits

https://gerrit.wikimedia.org/r/835265

The fact that devices are crashing now that were not crashing before is concerning here to the web team. We would recommend rolling back these changes until we understand the impact better. To do this, it sounds like we need to add some instrumentation. I propose we discuss this further and decide on what metrics we need to add and consult.

To document the current state:

Before: these pages were viewable (but without section collapsing or lazy loading images
Now: The page crashes on old devices and causes severe frontend lag on modern devices.

Here is a screengrab I took on a Samsung Galaxy A53, note the inability to scroll after I click the heading around the 2s mark:

ezgif-1-e6ab3314a3.gif (1×600 px, 2 MB)

The fact that devices are crashing now that were not crashing before is concerning here to the web team. We would recommend rolling back these changes until we understand the impact better. To do this, it sounds like we need to add some instrumentation. I propose we discuss this further and decide on what metrics we need to add and consult.

I would like to know the details about the devices. I did not experience such problems when testing on a 2016 iPhone SE. Unfortunately my old Android phone has bricked itself recently.

As far as I know, there is no way to instrument a genuine browser crash (or browser tab crash), because the instrumentation code itself would have to be a part of the tab that just crashed. Unless you mean something else by "crash".

Can you confirm that you tested after the config change to enable this on large pages yesterday, and not before? Since we're rendering less content now, I would expect the performance on slow devices to be better, rather than worse.

To document the current state:

Before: these pages were viewable (but without section collapsing or lazy loading images
Now: The page crashes on old devices and causes severe frontend lag on modern devices.

I would also add:
Before: The pages took multiple seconds to display
After: The pages load instantly (if the sections are collapsed)

Here is a screengrab I took on a Samsung Galaxy A53, note the inability to scroll after I click the heading around the 2s mark:
F35536443

This seems normal to me. It has not crashed, it is busy rendering the newly expanded content (which is extremely huge on this page, so it takes a while). You can see the content appear after a few seconds at the end of your video. I see similar behavior on the 2016 iPhone SE, and on my laptop as well.

Previously, all of this work would have happened on page load, so this behavior seems like an improvement to me.

I haven't had a chance to profile the state of things before the merge, but I just profiled a click on the "painting" section (pictured below) on https://en.m.wikipedia.org/wiki/User:Jane023/Paintings_in_the_Metropolitan_Museum_of_Art and it looks like most of the time (4s!) is indeed spent in rendering steps (recalc style, layout) rather than in the JS execution which only accounted for 10ms. Those rendering steps are taking a long time because of the unusually large number of elements in the DOM which leads to a huge rendering tree.

My guess (although I haven't verified it) is that a similar amount of time was being spent rendering the page on page load prior to this change (when the sections were uncollapsed) but now that time has now shifted to clicks on sections as that is when the browser is having to do the bulk of the rendering work now.

Screen Shot 2022-09-27 at 12.13.10 PM.png (828×3 px, 516 KB)

I should also note that this was with my macbook pro m1 with 6x cpu throttle. I'd love to be able to profile this on a lower end device which would be more representative of most devices, but I'm having issues with getting the profiler to work on my phone (my phone is dying).

Here's a full profile I just took when the page loads and then I click on the section. As mentioned in the previous post, the bulk of the rendering work is done after the click on the section, but something I didn't catch in the previous post is that it looks like during page load that initMediaViewer function and its iteration of thumbnails is also blocking the main thread for a large amount of time (1.38s). Do we know if this change had any influence on that function's execution time? Either way we should probably break up that function into smaller amounts of work e.g. with setTimeout or requestIdleCallback to give the main thread more time to breathe and respond to user input

Screen Shot 2022-09-27 at 1.26.45 PM.png (1×4 px, 384 KB)

My guess (although I haven't verified it) is that a similar amount of time was being spent rendering the page on page load prior to this change (when the sections were uncollapsed) but now that time has now shifted to clicks on sections as that is when the browser is having to do the bulk of the rendering work now.

Yes, that's consistent with what I was seeing. Before the page would take a long time to load (but it would load), and now it loads (freezing in some cases) and freezes on section collapsing. Right but it's not just the DOM tree, i think - for lazy loading image we observe every img tag and load them when they come into scroll (https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/2746dc7b41a74f4f5eba21db858608b8cfbcffc5/src/mobile.init/lazyLoadedImages.js#L26)

This seems normal to me. It has not crashed, it is busy rendering the newly expanded content

For reference, I said "The page crashes on old devices and causes severe frontend lag on modern devices."

This is a modern device. The gif demonstrates severe frontend lag. I have been unable to record an older device... but Moto G 2nd Generation on Browserstack seems to crash as an example. This wasn't happening in my testing prior to the change. The page was slow to load but usable (you could read all the content with a minimal lag).

I think at minimum we should understand the scale of this problem better (is this an edge case or quite common?) and improve the frontend code to handle this. However, now is really not a good time for the web team to do that given the current English Wikipedia RFC and ideally we would like to roll this back for now.

I don't think the crashes you're seeing are new in the new version of the code (at the time you wrote T317070#8262164, it was not enabled on that page). Even if they are, I don't think avoiding that issue on a single ridiculous page is worth the usability problem on all other non-ridiculously large pages.

However, this is not really that important to me, so do what you want. Please reopen the other tasks about the problem if you revert the fix.

We talked about this in the web/editing team sync today and we will revert the configuration change and open a new ticket for web backlog for dealing with improving the client side for these pages. I'll then resolve this task!

Jdlrobson closed this task as Resolved.EditedSep 30 2022, 1:24 AM
Jdlrobson updated the task description. (Show Details)

I've created T318991 for the follow-up work for web team on the frontend side of things.
From the backend perspective, this is a great change and we are very grateful for it happening!

(Note the config change was removed as part of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/835246 so now MobileFrontend master in the source of truth for the limits)