Page MenuHomePhabricator

[Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?
Closed, ResolvedPublicPRODUCTION ERROR

Description

Up until recently, PHP requests had no time out limit. Tim Starling fixed this. This exposes a few problems with certain pages rendering on mobile.

LazyImageTransform times out on complicated pages. Our error log says:

Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/transforms/LazyImageTransform.php on line 177

or

PHP Fatal Error from line 112 of /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileContext.php: entire web request took longer than 60 seconds and timed out

Example stacktrace:

#0 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileContext.php(112): NO_FUNCTION_GIVEN()
#1 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileFrontend.hooks.php(513): MobileContext::singleton()
#2 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileFrontend.hooks.php(541): MobileFrontendHooks::shouldMobileFormatSpecialPages(User)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(174): MobileFrontendHooks::onSpecialPageInitList(array)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(281): Hooks::run(string, array)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(297): MediaWiki\Special\SpecialPageFactory->getPageList()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(662): MediaWiki\Special\SpecialPageFactory->getAliasList()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(99): MediaWiki\Special\SpecialPageFactory->getLocalNameFor(string, boolean)
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(85): SpecialPage::getTitleValueFor(string, boolean, string)
#10 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(620): SpecialPage::getTitleFor(string)
#11 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(1248): SkinMinerva->getDiscoveryTools()
#12 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(1269): SkinMinerva->getMenuData()
#13 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(342): SkinMinerva->getSkinConfigVariables()
#14 /srv/mediawiki/php-1.32.0-wmf.20/includes/skins/SkinTemplate.php(224): SkinMinerva->initPage(OutputPage)
#15 /srv/mediawiki/php-1.32.0-wmf.20/includes/OutputPage.php(2402): SkinTemplate->outputPage()
#16 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(875): OutputPage->output(boolean)
#17 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(887): Closure$MediaWiki::main()
#18 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#19 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()
#20 /srv/mediawiki/w/index.php(3): include(string)
#21 {main}

Between September 10th and September 17th, there were 42 occurrences of the timed out request.

Developer notes

Options

Drop support for certain namespaces

We could not run the LazyImageTransform on pages with this many images.
Given this is only impacting user pages and the lazy image transform predominately was written for articles in the main namespace, I think it's acceptable to limit this transform to NS_MAIN or to blacklist the NS_USER namespace.

Trade offs: It's working fine in all of those other namespaces

Acceptance criteria

  • The LazyImageTransform only runs on namespaces it has been configured to run on.
  • It is does not run on NS_USER
  • It runs on NS_MAIN (main namespace)

Limit number of images

Risky change - but we could count images on a page and not run the transform. This would help this one case, but there may be other reasons the MobileFormatter (which is expensive) doesn't finish

Category/title blacklist

We could not run the formatter on pages which have a special category (that is hidden) that marks pages as not being suitable for mobile formatting. This would provide us a way to remove pages with performance issues like this one.

Spike questions

Output: A comment in this task. As part of sign off a concrete actionable task should be created.

Using logstash

  • For the pages that exhibit the problem is there anything common to them? e.g. namespace/number of images/size of document?
  • How many pages are we talking?

Techniques for disabling formatting

  • Explore techniques for identifying and disabling the MobileFormatter on large pages.
    • Can we count number of images? How? Regex? Any page properties we can use?
  • Is there any way we can defer the page formatting to the job queue?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Jdlrobson, what are we syncing on? I saw errors for https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages which has some 11k images.

@pmiazga is that the page you were seeing timeout errors on?

Maybe there is a faster/less resource heavy images transformation -> that the first thing to check. Second -> maybe we can just parse first X images, and then skip the rest, I agree, a page with 11k images is huge, and most probably we should skip transformation for such pages.
Do we know what is the average image count?

I agree, a page with 11k images is huge, and most probably we should skip transformation for such pages.

Since we support skipping small images, why not skip pages with lots of images too? I don't think we should spend a lot of time optimizing our code for this one page unless there are numerous other pages.

yes, we're skipping small images but somehow it doesn't work, we can add a simple check that only first X images will be transformed, that should be pretty easy.
I checked the code and it's pretty simple but most probably performance is pretty bad (we're doing many DOM operations on pretty big DOMDocument), I'll do some small research and try to do things differently (like clone elements instead of creating new one every time) and do some performance testing using https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages)

yes, we're skipping small images

I think we support skipping small images but that it's currently disabled.

Jdlrobson triaged this task as Medium priority.Sep 19 2018, 4:25 PM
Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter, thus will not display to Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 5:39 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)

The easiest thing to do here is just turn off the transform on the talk namespace and then re-evaluate. I'm guessing this is not going to be a problem on real world articles as they'd be too slow to load and would be unusable.

Let me know if you have a problem with this approach.

I'd like to add one more Acceptance Criteria -> parse only first 1000 images, then skip the rest. Anyone against?

Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Disable LazyImageTransform on user namespace: Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 5:41 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson renamed this task from Disable LazyImageTransform on user namespace: Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 6:13 PM
Jdlrobson lowered the priority of this task from Medium to Low.EditedSep 19 2018, 6:15 PM
Jdlrobson updated the task description. (Show Details)

Since there is only one page with timeout issues in the whole of our projects I'm marking this as low.
All solutions seem kinda risky right now.

Other possible solutions have been added to the description for future reference.

Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Some requests takes over 60 seconds to be processed: what to do?.Sep 20 2018, 11:21 PM

It's not just this one page.
Seeing issue in a number of places.
Turning large documents into DOM trees is expensive, so that doesn't really surprise me for MobileViewAPI and MobileFormatter.

[W6OxIgrAEEYAAAkGj@4AAAAN] /wiki/Alamitos_Beach,_Long_Beach,_California PHP Fatal Error from line 239 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/transforms/MoveLeadParagraphTransform.php: entire web request took longer than 60 seconds and timed out

[W6Ow7wrAEG4AADmleyMAAAAS] /w/api.php?action=mobileview&format=json&formatversion=2&prop=text%7Csections%7Clanguagecount%7Cthumb%7Cimage%7Cid%7Crevision%7Cdescription%7Clastmodified%7Cnormalizedtitle%7Cdisplaytitle%7Cprotection%7Ceditable&onlyrequestedsections=1&sections=0&sectionprop=toclevel%7Cline%7Canchor&noheadings=true&page=Liste_des_villes_d%27Inde&thumbsize=799 PHP Fatal Error from line 346 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/api/ApiMobileView.php: entire web request took longer than 60 seconds and timed out

Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.22/extensions/WikidataPageBanner/includes/WikidataPageBanner.functions.php on line 342
Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/specials/MobileSpecialPageFeed.php on line 54

Not sure what to do here. The server is doing the right thing to stop the requests.

I believe this is being tracked more broadly by RelEng in T204871.

It seems that a such problem occur on french wikisource with some mathematical book using entirely ''math'' tag. This chapter is unavailable, including more than 15k ''math'' calls. What to do?

Krinkle subscribed.

Detaching from T204871, which is about requests that are normally fast being slow in the first minutes after a deployment (regardless of whether they are killed by an execution timeout).

This task, though, is in about certain MobileFrontend requests that have presumably always been slow in production; but are as of last year being killed by the new execution timeout.

Krinkle renamed this task from Some requests takes over 60 seconds to be processed: what to do? to Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Feb 6 2019, 3:07 AM

@Krinkle Just to clarify that the problem we are affected by is present on the ''normal'' interface, not (only) with MobileFrontend requests.

I'd love to find out, what takes the most of the time during web request. The error message is not that processing that single part of code took more that 60 seconds. Timer starts when server receives the request, and when it runs out of time it just stops the execution and prints the current stacktrace, The MobileFrontend can be definitely an offender here (parsing huge DOM structures) but definitely there are other parts of code that just take time to process. For example there can be something that took over 50 seconds of the request time before MobileFormatter came into the play. If that happens most of the time, I wouldn't worry about the MobileFrontend transforms. But if the MobileFrontend transforms takes most of the request time - then it's definitely something we should look at.

For example, this is filtered log from fatalmonitor from right now, our requests timeouts in many places, and honestly it doesn't say a lot, we don't know what takes the most of the time, only where it ends.

104 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/DatabaseMysqli.php on line 46
    63 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 184
    18 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/vendor/wikimedia/html-formatter/src/HtmlFormatter.php on line 81
    16 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/diff/TextSlotDiffRenderer.php on line 225
     7 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/http/CurlHttpRequest.php on line 36
     7 Fatal error: entire web request took longer than 60 seconds and timed out
     6 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Parser.php on line 2543
     5 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/languages/LanguageConverter.php on line 423
     3 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 1761
     3 Fatal error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/extensions/Echo/includes/DiscussionParser.php on line 740

Performance-Team - Is there a possibility to retrieve some profiling data from production? Other than me querying the webaccess log to find requests that failed and using browser extension to get profiling data and hope that given request timeouts?
Also, looks like Xhgui is doesn't work, when I try to reach any data it ends up with MongoException.

Accessing view-source:https://fr.wikisource.org/wiki/Recherches_arithm%C3%A9tiques/Section_cinqui%C3%A8me?forceprofile=1 with the X-Wikimedia-Debug Chrome extension gives a reasonable 1.3s time. The XHGUI data for that is at https://performance.wikimedia.org/xhgui/run/view?id=5c79abfb3f3dfa404ee10401 .

100.00% 1343.538      1 - main()
 99.99% 1343.442      1 - {internal}
 99.98% 1343.246      1 - run_init::/srv/mediawiki/php-1.33.0-wmf.19/index.php
 94.21% 1265.749      1 - MediaWiki::run
 93.40% 1254.828      1 - MediaWiki::main
 55.19% 741.538      1 - MediaWiki::__invoke
 55.19% 741.519      1 - OutputPage::output
 36.22% 486.619      1 - MediaWiki::performRequest
 35.98% 483.454      1 - MediaWiki::performAction
 35.82% 481.304      1 - ViewAction::show
 35.16% 472.397      1 - Article::view
 29.78% 400.120    852 - Hooks::callHook
 29.07% 390.561      1 - SkinTemplate::outputPage
 26.81% 360.136    130 - BagOStuff::get
 26.17% 351.585      1 - ParserCache::get
 26.00% 349.283      2 - MultiWriteBagOStuff::doGet
 23.89% 320.910     12 - Hooks::runWithoutAbort
 23.46% 315.150    119 - WANObjectCache::getWithSetCallback
 23.39% 314.227    108 - WANObjectCache::doGetWithSetCallback
 23.16% 311.223    114 - WANObjectCache::get
 23.12% 310.686    114 - WANObjectCache::getMulti
 23.01% 309.200    117 - MemcachedPeclBagOStuff::getMulti
 19.26% 258.800      1 - MultiWriteBagOStuff::doWrite
 19.26% 258.774      1 - MemcachedPeclBagOStuff::set
 19.12% 256.836      1 - MemcachedBagOStuff::set
 17.61% 236.572     14 - WikiPage::getContentModel
 17.57% 236.088     13 - WikiPage::getContentHandler
 17.53% 235.569     12 - Action::factory
 17.51% 235.295     11 - Action::getActionName
 17.51% 235.282     12 - WikiPage::getActionOverrides
 16.79% 225.593      1 - SkinTemplate::prepareQuickTemplate
 16.64% 223.559      1 - Wikibase\ClientHooks::onBeforePageDisplay
 12.73% 171.097    347 - Message::fetchMessage
 12.42% 166.916    283 - MessageCache::get
 11.54% 155.061      1 - VectorTemplate::execute

If I add &useformat=mobile I get a timeout error (with or without xhgui on). I tried to enabled logging instead and checked logstash for reqId:XHmuSApAIHsAAFfoTHgAAABY but didn't see anything interesting. Though the page is slow to load in my browser...most of that is rendering and not backend. MF is probably adding a lot execution time on the backend to get these timeouts.

I made a version of the page that extracts less djvu pages (but still many) and tried to profile again:

https://fr.wikisource.org/w/index.php?title=Utilisateur:Aaron_Schulz/test&oldid=9556994&forceprofile=1&useformat=mobile

<!--
100.00% 39224.697      1 - main()
100.00% 39224.595      1 - {internal}
100.00% 39224.376      1 - run_init::/srv/mediawiki/php-1.33.0-wmf.25/index.php
 99.70% 39105.979      1 - MediaWiki::run
 99.68% 39097.906      1 - MediaWiki::main
 94.39% 37022.307    776 - Hooks::callHook
 93.63% 36726.145      1 - MediaWiki::performRequest
 93.62% 36722.396      1 - MediaWiki::performAction
 93.57% 36701.133      1 - ViewAction::show
 93.55% 36694.968      1 - Article::view
 92.88% 36430.579     21 - Hooks::runWithoutAbort
 92.74% 36378.231      1 - OutputPage::addParserOutput
 92.72% 36371.007      1 - OutputPage::addParserOutputText
 92.67% 36349.077      1 - MobileFrontendHooks::onOutputPageBeforeHTML
 91.96% 36069.566      1 - ExtMobileFrontend::domParse
 91.04% 35710.453      1 - MobileFormatter::filterContent
 89.83% 35235.226      2 - HtmlFormatter\HtmlFormatter::getDoc
  5.96% 2337.031      1 - MediaWiki::__invoke
  5.96% 2337.021      1 - OutputPage::output
  4.94% 1937.355      1 - SkinTemplate::outputPage
  2.43% 952.279      1 - SkinMinerva::prepareQuickTemplate
  1.99% 780.982      1 - MinervaTemplate::execute
  1.96% 769.385      1 - MinervaTemplate::render
  1.66% 652.168      3 - TemplateParser::processTemplate
  1.55% 607.203    948 - Hooks::run
  1.46% 574.478      1 - SkinTemplate::prepareQuickTemplate
  1.28% 501.857      1 - MinervaTemplate::getPageActionsHtml
  1.10% 430.802      3 - TemplateParser::getTemplate
  1.06% 416.854      1 - TemplateParser::compileForEval
  1.05% 412.184      1 - TemplateParser::compile
  1.04% 409.209      1 - LightnCandy::compile
  0.83% 324.608      1 - Skin::bottomScripts
  0.77% 301.819      1 - OutputPage::getBottomScripts
  0.74% 291.496      3 - OutputPage::getRlClient
  0.69% 270.931      4 - ResourceLoader::filter
  0.67% 264.226      1 - MobileFrontendSkinHooks::interimTogglingSupport

So, I don't really think loading the parsed HTML into a DOM object like this is really acceptable performance-wise, and I don't see any trick to make it fast.

The use of preg_replace_callback is worth considering, given that <img> tags have no nesting nor inner text and that metacharacters in the attributes would be escaped. The function call would find <img> tags and run a callback on them to get the replacement value. The callback could turn each HTML string match into a tiny DOM element, manipulate it, and turn it back to HTML.

aaron removed aaron as the assignee of this task.Apr 12 2019, 12:50 AM
aaron raised the priority of this task from Low to Medium.
aaron subscribed.

The DOM parsing is used for a variety of things not just images and much of it a little too complicated for preg_replace_callback :/ One solution might be to detect how "complex" a page might be and not run the formatter at all on those pages (I'm not sure what we'd use to indicate that, but number of images might be a good indicator). We might even add some kind of warning box at the top of the page indicating the page may be unsuitable for mobile viewing.

Is there a way we could make use of the job queue for these kind of pages, e.g. running the transformations in the background and updating the varnish cached HTML when complete, serving the unoptimised page in the mean time?

What other things are there other than <img> tags? Maybe you could start moving off those things that can be changed to avoid the DOM and document the rest. As I described above, I don't mind hybrid regex/DomElement logic. You could also have a simple state machine that reads the HTML body string, looking for certain tags and when the corresponding terminating tag is found, recording the positions, and using those to get chunk of strings to parse as DOMElements. That would require some assumptions about HTML being balanced/correct (I assume our Parser should handle that already with Tidy/Remex). I'm thinking of something simple kind of like Language::truncateHtml().

What other things are there other than <img> tags?

We swap lead paragraph with infobox, we remove nodes, we parse references, we reconstruct the main page (trying to deprecate this one though) and we wrap sections (soon to be done in the parser). I don't think we can commit to rewriting these to not use DOM right now - that's a considerable amount of work.

Links:

Right, so this isn't really about LazyImageTransform, but rather than is just one thing used the parsed DOM object. I guess the summary through me off.

Jdlrobson raised the priority of this task from Medium to High.
Jdlrobson added a project: Spike.

Given your work with jank are there certain rules we could bring in to decide whether to run a page through the mobileformatter?

Options:

  • disable mobile formatting when page is above certain number of bytes/links/images
  • allow a magic word which disables mobile formatting for a page.

Cc @pmiazga let's turn this spike into an actionable

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 8 2019, 8:48 PM
Jdlrobson renamed this task from Some requests time out after 60 seconds in MobileFrontend transforms. What to do? to [Spike, 4hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Jul 15 2019, 6:43 PM
Jdlrobson removed Jdlrobson as the assignee of this task.
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Prioritization to Upcoming on the Web-Team-Backlog board.
ovasileva renamed this task from [Spike, 4hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do? to [Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Jul 16 2019, 4:16 PM
mmodell changed the subtype of this task from "Spike" to "Production Error".Aug 28 2019, 11:09 PM

Last 7 days:

  • LazyImageTransform Timeouts: 7
  • MobileFormatter Timeouts: 10

For comparison, https://en.wikipedia.org/wiki/Barack_Obama:

  • 86 images, 73 headings, 16,612 total elements

The following url counts were taken on the vector desktop site (assuming that MobileFormatter/LazyImageTransform operates on this HTML):

LazyImageTransform Timeouts:

MobileFormatter Timeouts:

MobileFrontendHooks Timeouts:

Wowser! 9,991 headigs! 21,303 images !!!

These all seem pretty high! It feels like a count of '<img ' in a document and disabling the MobileFormatter for anything over 1000 would go a long way. Even more so if we can include headings in that count. The MobileFormatter will loop over every single one of them.

Jdlrobson claimed this task.

Thanks for looking into this @nray! Your analysis and proposal looks good!

@nray as part of this I think it might make some sense to do some frontend performance profiling too (and possibly adjust the frontend accordingly). What do you think?

When looking at https://en.wikipedia.org/wiki/User:Jane023/Paintings_in_the_Metropolitan_Museum_of_Art my mobile browser freezes with both pages (due to not having enough memory)
At a glance it does look like the new version is more performant (presumably because a bunch of section collapsing and lazy loading images)

@Jdlrobson I'd be interested in profiling it and see what we find. Also, how are you comparing it to the old version?

What phone are you using?

On my pixel 5, it takes seconds to paint the page after scrolling which makes the page pretty unusable. That page has 172,055 elements which is well over the recommended limit of 1,500 DOM elements so it makes sense that style recalcs as you scroll would be very costly.

I was testing this locally on a Samsung Galaxy. You can compare origin/wmf/1.40.0-wmf.1 against 1.39.0-wmf.28