Page MenuHomePhabricator

[Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?
Open, HighPublicSpike

Description

Up until recently, PHP requests had no time out limit. Tim Starling fixed this. This exposes a few problems with certain pages rendering on mobile.

LazyImageTransform times out on complicated pages. Our error log says:

Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/transforms/LazyImageTransform.php on line 177

or

PHP Fatal Error from line 112 of /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileContext.php: entire web request took longer than 60 seconds and timed out

Example stacktrace:

#0 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileContext.php(112): NO_FUNCTION_GIVEN()
#1 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileFrontend.hooks.php(513): MobileContext::singleton()
#2 /srv/mediawiki/php-1.32.0-wmf.20/extensions/MobileFrontend/includes/MobileFrontend.hooks.php(541): MobileFrontendHooks::shouldMobileFormatSpecialPages(User)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(174): MobileFrontendHooks::onSpecialPageInitList(array)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(281): Hooks::run(string, array)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(297): MediaWiki\Special\SpecialPageFactory->getPageList()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(662): MediaWiki\Special\SpecialPageFactory->getAliasList()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(99): MediaWiki\Special\SpecialPageFactory->getLocalNameFor(string, boolean)
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(85): SpecialPage::getTitleValueFor(string, boolean, string)
#10 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(620): SpecialPage::getTitleFor(string)
#11 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(1248): SkinMinerva->getDiscoveryTools()
#12 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(1269): SkinMinerva->getMenuData()
#13 /srv/mediawiki/php-1.32.0-wmf.20/skins/MinervaNeue/includes/skins/SkinMinerva.php(342): SkinMinerva->getSkinConfigVariables()
#14 /srv/mediawiki/php-1.32.0-wmf.20/includes/skins/SkinTemplate.php(224): SkinMinerva->initPage(OutputPage)
#15 /srv/mediawiki/php-1.32.0-wmf.20/includes/OutputPage.php(2402): SkinTemplate->outputPage()
#16 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(875): OutputPage->output(boolean)
#17 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(887): Closure$MediaWiki::main()
#18 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#19 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()
#20 /srv/mediawiki/w/index.php(3): include(string)
#21 {main}

Between September 10th and September 17th, there were 42 occurrences of the timed out request.

Developer notes

Options

Drop support for certain namespaces

We could not run the LazyImageTransform on pages with this many images.
Given this is only impacting user pages and the lazy image transform predominately was written for articles in the main namespace, I think it's acceptable to limit this transform to NS_MAIN or to blacklist the NS_USER namespace.

Trade offs: It's working fine in all of those other namespaces

Acceptance criteria

  • The LazyImageTransform only runs on namespaces it has been configured to run on.
  • It is does not run on NS_USER
  • It runs on NS_MAIN (main namespace)

Limit number of images

Risky change - but we could count images on a page and not run the transform. This would help this one case, but there may be other reasons the MobileFormatter (which is expensive) doesn't finish

Category/title blacklist

We could not run the formatter on pages which have a special category (that is hidden) that marks pages as not being suitable for mobile formatting. This would provide us a way to remove pages with performance issues like this one.

Spike questions

Output: A comment in this task. As part of sign off a concrete actionable task should be created.

Using logstash

  • For the pages that exhibit the problem is there anything common to them? e.g. namespace/number of images/size of document?
  • How many pages are we talking?

Techniques for disabling formatting

  • Explore techniques for identifying and disabling the MobileFormatter on large pages.
    • Can we count number of images? How? Regex? Any page properties we can use?
  • Is there any way we can defer the page formatting to the job queue?

Event Timeline

pmiazga created this task.Sep 17 2018, 8:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2018, 8:55 PM

This relates to Tim Starlings' email: https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html

Given the amount of images in the page, the page is too complicated to be processed by MobileFormatter.
Does the same issue happen on desktop views?

What can be done about this (if anything)?
I could imagine us checking the complexity of the DOM and refusing to run the MobileFormatter on such pages, or maintaining a blacklist.

A timeout seems appropriate here, not a bug.

Jdlrobson updated the task description. (Show Details)Sep 17 2018, 10:11 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Niedzielski.

@Niedzielski and @pmiazga can you sync and check these errors relate to the same page?

Jdlrobson renamed this task from Request takes over 60 seconds to process the page to Some requests takes over 60 seconds to be processed by MobileFormatter, thus will not display.Sep 17 2018, 10:12 PM

@Jdlrobson, what are we syncing on? I saw errors for https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages which has some 11k images.

@Jdlrobson, what are we syncing on? I saw errors for https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages which has some 11k images.

@pmiazga is that the page you were seeing timeout errors on?

Maybe there is a faster/less resource heavy images transformation -> that the first thing to check. Second -> maybe we can just parse first X images, and then skip the rest, I agree, a page with 11k images is huge, and most probably we should skip transformation for such pages.
Do we know what is the average image count?

I agree, a page with 11k images is huge, and most probably we should skip transformation for such pages.

Since we support skipping small images, why not skip pages with lots of images too? I don't think we should spend a lot of time optimizing our code for this one page unless there are numerous other pages.

yes, we're skipping small images but somehow it doesn't work, we can add a simple check that only first X images will be transformed, that should be pretty easy.
I checked the code and it's pretty simple but most probably performance is pretty bad (we're doing many DOM operations on pretty big DOMDocument), I'll do some small research and try to do things differently (like clone elements instead of creating new one every time) and do some performance testing using https://en.wikipedia.org/wiki/User:West.andrew.g/2013_popular_pages)

yes, we're skipping small images

I think we support skipping small images but that it's currently disabled.

Jdlrobson triaged this task as Normal priority.Sep 19 2018, 4:25 PM
Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter, thus will not display to Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 5:39 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Analysis to Triaged but Future on the Readers-Web-Backlog board.

The easiest thing to do here is just turn off the transform on the talk namespace and then re-evaluate. I'm guessing this is not going to be a problem on real world articles as they'd be too slow to load and would be unusable.

Let me know if you have a problem with this approach.

I'd like to add one more Acceptance Criteria -> parse only first 1000 images, then skip the rest. Anyone against?

Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Disable LazyImageTransform on user namespace: Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 5:41 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson renamed this task from Disable LazyImageTransform on user namespace: Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display.Sep 19 2018, 6:13 PM
Jdlrobson lowered the priority of this task from Normal to Low.EditedSep 19 2018, 6:15 PM
Jdlrobson updated the task description. (Show Details)

Since there is only one page with timeout issues in the whole of our projects I'm marking this as low.
All solutions seem kinda risky right now.

Other possible solutions have been added to the description for future reference.

Jdlrobson renamed this task from Some requests takes over 60 seconds to be processed by MobileFormatter when page has lots of images thus will not display to Some requests takes over 60 seconds to be processed: what to do?.Sep 20 2018, 11:21 PM

It's not just this one page.
Seeing issue in a number of places.
Turning large documents into DOM trees is expensive, so that doesn't really surprise me for MobileViewAPI and MobileFormatter.

[W6OxIgrAEEYAAAkGj@4AAAAN] /wiki/Alamitos_Beach,_Long_Beach,_California PHP Fatal Error from line 239 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/transforms/MoveLeadParagraphTransform.php: entire web request took longer than 60 seconds and timed out

[W6Ow7wrAEG4AADmleyMAAAAS] /w/api.php?action=mobileview&format=json&formatversion=2&prop=text%7Csections%7Clanguagecount%7Cthumb%7Cimage%7Cid%7Crevision%7Cdescription%7Clastmodified%7Cnormalizedtitle%7Cdisplaytitle%7Cprotection%7Ceditable&onlyrequestedsections=1&sections=0&sectionprop=toclevel%7Cline%7Canchor&noheadings=true&page=Liste_des_villes_d%27Inde&thumbsize=799 PHP Fatal Error from line 346 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/api/ApiMobileView.php: entire web request took longer than 60 seconds and timed out

Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.22/extensions/WikidataPageBanner/includes/WikidataPageBanner.functions.php on line 342
Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.22/extensions/MobileFrontend/includes/specials/MobileSpecialPageFeed.php on line 54

Not sure what to do here. The server is doing the right thing to stop the requests.

I believe this is being tracked more broadly by RelEng in T204871.

It seems that a such problem occur on french wikisource with some mathematical book using entirely ''math'' tag. This chapter is unavailable, including more than 15k ''math'' calls. What to do?

Krinkle added a subscriber: Krinkle.

Detaching from T204871, which is about requests that are normally fast being slow in the first minutes after a deployment (regardless of whether they are killed by an execution timeout).

This task, though, is in about certain MobileFrontend requests that have presumably always been slow in production; but are as of last year being killed by the new execution timeout.

Krinkle renamed this task from Some requests takes over 60 seconds to be processed: what to do? to Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Feb 6 2019, 3:07 AM

@Krinkle Just to clarify that the problem we are affected by is present on the ''normal'' interface, not (only) with MobileFrontend requests.

pmiazga added a comment.EditedFeb 6 2019, 2:47 PM

I'd love to find out, what takes the most of the time during web request. The error message is not that processing that single part of code took more that 60 seconds. Timer starts when server receives the request, and when it runs out of time it just stops the execution and prints the current stacktrace, The MobileFrontend can be definitely an offender here (parsing huge DOM structures) but definitely there are other parts of code that just take time to process. For example there can be something that took over 50 seconds of the request time before MobileFormatter came into the play. If that happens most of the time, I wouldn't worry about the MobileFrontend transforms. But if the MobileFrontend transforms takes most of the request time - then it's definitely something we should look at.

For example, this is filtered log from fatalmonitor from right now, our requests timeouts in many places, and honestly it doesn't say a lot, we don't know what takes the most of the time, only where it ends.

104 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/DatabaseMysqli.php on line 46
    63 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 184
    18 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/vendor/wikimedia/html-formatter/src/HtmlFormatter.php on line 81
    16 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/diff/TextSlotDiffRenderer.php on line 225
     7 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/http/CurlHttpRequest.php on line 36
     7 Fatal error: entire web request took longer than 60 seconds and timed out
     6 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Parser.php on line 2543
     5 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/languages/LanguageConverter.php on line 423
     3 Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 1761
     3 Fatal error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.14/extensions/Echo/includes/DiscussionParser.php on line 740

Performance-Team - Is there a possibility to retrieve some profiling data from production? Other than me querying the webaccess log to find requests that failed and using browser extension to get profiling data and hope that given request timeouts?
Also, looks like Xhgui is doesn't work, when I try to reach any data it ends up with MongoException.

kchapman assigned this task to aaron.Feb 11 2019, 9:34 PM
kchapman moved this task from Inbox to Backlog: Small & Maintenance on the Performance-Team board.
aaron added a comment.Mar 1 2019, 10:36 PM

Accessing view-source:https://fr.wikisource.org/wiki/Recherches_arithm%C3%A9tiques/Section_cinqui%C3%A8me?forceprofile=1 with the X-Wikimedia-Debug Chrome extension gives a reasonable 1.3s time. The XHGUI data for that is at https://performance.wikimedia.org/xhgui/run/view?id=5c79abfb3f3dfa404ee10401 .

100.00% 1343.538      1 - main()
 99.99% 1343.442      1 - {internal}
 99.98% 1343.246      1 - run_init::/srv/mediawiki/php-1.33.0-wmf.19/index.php
 94.21% 1265.749      1 - MediaWiki::run
 93.40% 1254.828      1 - MediaWiki::main
 55.19% 741.538      1 - MediaWiki::__invoke
 55.19% 741.519      1 - OutputPage::output
 36.22% 486.619      1 - MediaWiki::performRequest
 35.98% 483.454      1 - MediaWiki::performAction
 35.82% 481.304      1 - ViewAction::show
 35.16% 472.397      1 - Article::view
 29.78% 400.120    852 - Hooks::callHook
 29.07% 390.561      1 - SkinTemplate::outputPage
 26.81% 360.136    130 - BagOStuff::get
 26.17% 351.585      1 - ParserCache::get
 26.00% 349.283      2 - MultiWriteBagOStuff::doGet
 23.89% 320.910     12 - Hooks::runWithoutAbort
 23.46% 315.150    119 - WANObjectCache::getWithSetCallback
 23.39% 314.227    108 - WANObjectCache::doGetWithSetCallback
 23.16% 311.223    114 - WANObjectCache::get
 23.12% 310.686    114 - WANObjectCache::getMulti
 23.01% 309.200    117 - MemcachedPeclBagOStuff::getMulti
 19.26% 258.800      1 - MultiWriteBagOStuff::doWrite
 19.26% 258.774      1 - MemcachedPeclBagOStuff::set
 19.12% 256.836      1 - MemcachedBagOStuff::set
 17.61% 236.572     14 - WikiPage::getContentModel
 17.57% 236.088     13 - WikiPage::getContentHandler
 17.53% 235.569     12 - Action::factory
 17.51% 235.295     11 - Action::getActionName
 17.51% 235.282     12 - WikiPage::getActionOverrides
 16.79% 225.593      1 - SkinTemplate::prepareQuickTemplate
 16.64% 223.559      1 - Wikibase\ClientHooks::onBeforePageDisplay
 12.73% 171.097    347 - Message::fetchMessage
 12.42% 166.916    283 - MessageCache::get
 11.54% 155.061      1 - VectorTemplate::execute

If I add &useformat=mobile I get a timeout error (with or without xhgui on). I tried to enabled logging instead and checked logstash for reqId:XHmuSApAIHsAAFfoTHgAAABY but didn't see anything interesting. Though the page is slow to load in my browser...most of that is rendering and not backend. MF is probably adding a lot execution time on the backend to get these timeouts.

aaron added a comment.Apr 11 2019, 6:13 PM

I made a version of the page that extracts less djvu pages (but still many) and tried to profile again:

https://fr.wikisource.org/w/index.php?title=Utilisateur:Aaron_Schulz/test&oldid=9556994&forceprofile=1&useformat=mobile

<!--
100.00% 39224.697      1 - main()
100.00% 39224.595      1 - {internal}
100.00% 39224.376      1 - run_init::/srv/mediawiki/php-1.33.0-wmf.25/index.php
 99.70% 39105.979      1 - MediaWiki::run
 99.68% 39097.906      1 - MediaWiki::main
 94.39% 37022.307    776 - Hooks::callHook
 93.63% 36726.145      1 - MediaWiki::performRequest
 93.62% 36722.396      1 - MediaWiki::performAction
 93.57% 36701.133      1 - ViewAction::show
 93.55% 36694.968      1 - Article::view
 92.88% 36430.579     21 - Hooks::runWithoutAbort
 92.74% 36378.231      1 - OutputPage::addParserOutput
 92.72% 36371.007      1 - OutputPage::addParserOutputText
 92.67% 36349.077      1 - MobileFrontendHooks::onOutputPageBeforeHTML
 91.96% 36069.566      1 - ExtMobileFrontend::domParse
 91.04% 35710.453      1 - MobileFormatter::filterContent
 89.83% 35235.226      2 - HtmlFormatter\HtmlFormatter::getDoc
  5.96% 2337.031      1 - MediaWiki::__invoke
  5.96% 2337.021      1 - OutputPage::output
  4.94% 1937.355      1 - SkinTemplate::outputPage
  2.43% 952.279      1 - SkinMinerva::prepareQuickTemplate
  1.99% 780.982      1 - MinervaTemplate::execute
  1.96% 769.385      1 - MinervaTemplate::render
  1.66% 652.168      3 - TemplateParser::processTemplate
  1.55% 607.203    948 - Hooks::run
  1.46% 574.478      1 - SkinTemplate::prepareQuickTemplate
  1.28% 501.857      1 - MinervaTemplate::getPageActionsHtml
  1.10% 430.802      3 - TemplateParser::getTemplate
  1.06% 416.854      1 - TemplateParser::compileForEval
  1.05% 412.184      1 - TemplateParser::compile
  1.04% 409.209      1 - LightnCandy::compile
  0.83% 324.608      1 - Skin::bottomScripts
  0.77% 301.819      1 - OutputPage::getBottomScripts
  0.74% 291.496      3 - OutputPage::getRlClient
  0.69% 270.931      4 - ResourceLoader::filter
  0.67% 264.226      1 - MobileFrontendSkinHooks::interimTogglingSupport

So, I don't really think loading the parsed HTML into a DOM object like this is really acceptable performance-wise, and I don't see any trick to make it fast.

The use of preg_replace_callback is worth considering, given that <img> tags have no nesting nor inner text and that metacharacters in the attributes would be escaped. The function call would find <img> tags and run a callback on them to get the replacement value. The callback could turn each HTML string match into a tiny DOM element, manipulate it, and turn it back to HTML.

aaron removed aaron as the assignee of this task.Apr 12 2019, 12:50 AM
aaron raised the priority of this task from Low to Normal.
aaron added a subscriber: aaron.

The DOM parsing is used for a variety of things not just images and much of it a little too complicated for preg_replace_callback :/ One solution might be to detect how "complex" a page might be and not run the formatter at all on those pages (I'm not sure what we'd use to indicate that, but number of images might be a good indicator). We might even add some kind of warning box at the top of the page indicating the page may be unsuitable for mobile viewing.

Is there a way we could make use of the job queue for these kind of pages, e.g. running the transformations in the background and updating the varnish cached HTML when complete, serving the unoptimised page in the mean time?

aaron added a comment.Apr 24 2019, 9:05 AM

What other things are there other than <img> tags? Maybe you could start moving off those things that can be changed to avoid the DOM and document the rest. As I described above, I don't mind hybrid regex/DomElement logic. You could also have a simple state machine that reads the HTML body string, looking for certain tags and when the corresponding terminating tag is found, recording the positions, and using those to get chunk of strings to parse as DOMElements. That would require some assumptions about HTML being balanced/correct (I assume our Parser should handle that already with Tidy/Remex). I'm thinking of something simple kind of like Language::truncateHtml().

What other things are there other than <img> tags?

We swap lead paragraph with infobox, we remove nodes, we parse references, we reconstruct the main page (trying to deprecate this one though) and we wrap sections (soon to be done in the parser). I don't think we can commit to rewriting these to not use DOM right now - that's a considerable amount of work.

Links:

Right, so this isn't really about LazyImageTransform, but rather than is just one thing used the parsed DOM object. I guess the summary through me off.

Jdlrobson assigned this task to nray.Mon, Jul 8, 8:48 PM
Jdlrobson raised the priority of this task from Normal to High.
Jdlrobson added a project: Spike.

Given your work with jank are there certain rules we could bring in to decide whether to run a page through the mobileformatter?

Options:

  • disable mobile formatting when page is above certain number of bytes/links/images
  • allow a magic word which disables mobile formatting for a page.

Cc @pmiazga let's turn this spike into an actionable

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMon, Jul 8, 8:48 PM
ovasileva reassigned this task from nray to Jdlrobson.Mon, Jul 15, 6:18 PM
ovasileva added a subscriber: nray.
Jdlrobson renamed this task from Some requests time out after 60 seconds in MobileFrontend transforms. What to do? to [Spike, 4hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Mon, Jul 15, 6:43 PM
Jdlrobson removed Jdlrobson as the assignee of this task.
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Analysis to Upcoming on the Readers-Web-Backlog board.
ovasileva renamed this task from [Spike, 4hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do? to [Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Tue, Jul 16, 4:16 PM