Page MenuHomePhabricator

Wikimedia\RequestTimeout\RequestTimeoutException on de:Holomorphe_Funktion and several other math-heavy articles
Closed, ResolvedPublicPRODUCTION ERROR

Description

Accessing https://de.wikipedia.org/wiki/Holomorphe_Funktion throws

[86a08c26-0800-4cae-8d13-752e69516a5d] 2023-07-12 09:40:57: Fataler Ausnahmefehler des Typs „Wikimedia\RequestTimeout\RequestTimeoutException“

The maximum execution time of 60 seconds was exceeded

from /srv/mediawiki/php-1.41.0-wmf.16/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(97)
#0 /srv/mediawiki/php-1.41.0-wmf.16/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(72): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(integer)
#1 /srv/mediawiki/php-1.41.0-wmf.16/includes/libs/http/MultiHttpClient.php(281): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(integer)
#2 /srv/mediawiki/php-1.41.0-wmf.16/includes/libs/http/MultiHttpClient.php(217): MultiHttpClient->runMultiCurl(array, array)
#3 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Math/src/MathRestbaseInterface.php(180): MultiHttpClient->runMulti(array)
#4 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Math/src/MathMathML.php(108): MediaWiki\Extension\Math\MathRestbaseInterface::batchEvaluate(array)
#5 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Math/src/HookHandlers/ParserHooksHandler.php(158): MediaWiki\Extension\Math\MathMathML::batchEvaluate(array)
#6 /srv/mediawiki/php-1.41.0-wmf.16/includes/HookContainer/HookContainer.php(153): MediaWiki\Extension\Math\HookHandlers\ParserHooksHandler->onParserAfterTidy(Parser, string)
#7 /srv/mediawiki/php-1.41.0-wmf.16/includes/HookContainer/HookRunner.php(2904): MediaWiki\HookContainer\HookContainer->run(string, array)
#8 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(1734): MediaWiki\HookContainer\HookRunner->onParserAfterTidy(Parser, string)
#9 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(703): Parser->internalParseHalfParsed(string, boolean, boolean)
#10 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(1484): Parser->parse(string, MediaWiki\Title\Title, ParserOptions, boolean)
#11 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(1428): MessageCache->parse(string, MediaWiki\Title\Title, boolean, boolean, Language)
#12 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(1005): Message->parseText(string)
#13 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(1043): Message->format(string)
#14 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/ReferencesFormatter.php(185): Message->parse()
#15 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/ReferencesFormatter.php(120): Cite\ReferencesFormatter->formatListItem(Parser, integer, array, boolean)
#16 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/ReferencesFormatter.php(53): Cite\ReferencesFormatter->formatRefsList(Parser, array, boolean)
#17 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/Cite.php(479): Cite\ReferencesFormatter->formatReferences(Parser, array, boolean, boolean)
#18 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/Cite.php(451): Cite\Cite->formatReferences(Parser, string, NULL)
#19 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/Cite.php(398): Cite\Cite->guardedReferences(Parser, NULL, array)
#20 /srv/mediawiki/php-1.41.0-wmf.16/extensions/Cite/src/Hooks/CiteParserTagHooks.php(72): Cite\Cite->references(Parser, NULL, array)
#21 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(3986): Cite\Hooks\CiteParserTagHooks::references(NULL, array, Parser, PPFrame_Hash)
#22 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/PPFrame_Hash.php(352): Parser->extensionSubstitution(array, PPFrame_Hash, boolean)
#23 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(2932): PPFrame_Hash->expand(PPNode_Hash_Tree, integer)
#24 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(1588): Parser->replaceVariables(string)
#25 /srv/mediawiki/php-1.41.0-wmf.16/includes/parser/Parser.php(700): Parser->internalParse(string)
#26 /srv/mediawiki/php-1.41.0-wmf.16/includes/content/WikitextContentHandler.php(357): Parser->parse(string, MediaWiki\Title\Title, ParserOptions, boolean, boolean, integer)
#27 /srv/mediawiki/php-1.41.0-wmf.16/includes/content/ContentHandler.php(1764): WikitextContentHandler->fillParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams, ParserOutput)
#28 /srv/mediawiki/php-1.41.0-wmf.16/includes/content/Renderer/ContentRenderer.php(47): ContentHandler->getParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams)
#29 /srv/mediawiki/php-1.41.0-wmf.16/includes/Revision/RenderedRevision.php(260): MediaWiki\Content\Renderer\ContentRenderer->getParserOutput(WikitextContent, MediaWiki\Title\Title, integer, ParserOptions, boolean)
#30 /srv/mediawiki/php-1.41.0-wmf.16/includes/Revision/RenderedRevision.php(232): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean)
#31 /srv/mediawiki/php-1.41.0-wmf.16/includes/Revision/RevisionRenderer.php(227): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array)
#32 /srv/mediawiki/php-1.41.0-wmf.16/includes/Revision/RevisionRenderer.php(164): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#33 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array)
#34 /srv/mediawiki/php-1.41.0-wmf.16/includes/Revision/RenderedRevision.php(199): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array)
#35 /srv/mediawiki/php-1.41.0-wmf.16/includes/poolcounter/PoolWorkArticleView.php(87): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#36 /srv/mediawiki/php-1.41.0-wmf.16/includes/poolcounter/PoolWorkArticleViewCurrent.php(97): PoolWorkArticleView->renderRevision()
#37 /srv/mediawiki/php-1.41.0-wmf.16/includes/poolcounter/PoolCounterWork.php(166): PoolWorkArticleViewCurrent->doWork()
#38 /srv/mediawiki/php-1.41.0-wmf.16/includes/page/ParserOutputAccess.php(299): PoolCounterWork->execute()
#39 /srv/mediawiki/php-1.41.0-wmf.16/includes/page/Article.php(740): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreCacheRecord, integer)
#40 /srv/mediawiki/php-1.41.0-wmf.16/includes/page/Article.php(546): Article->generateContentOutput(User, ParserOptions, integer, OutputPage, array)
#41 /srv/mediawiki/php-1.41.0-wmf.16/includes/actions/ViewAction.php(78): Article->view()
#42 /srv/mediawiki/php-1.41.0-wmf.16/includes/MediaWiki.php(561): ViewAction->show()
#43 /srv/mediawiki/php-1.41.0-wmf.16/includes/MediaWiki.php(334): MediaWiki->performAction(Article, MediaWiki\Title\Title)
#44 /srv/mediawiki/php-1.41.0-wmf.16/includes/MediaWiki.php(927): MediaWiki->performRequest()
#45 /srv/mediawiki/php-1.41.0-wmf.16/includes/MediaWiki.php(581): MediaWiki->main()
#46 /srv/mediawiki/php-1.41.0-wmf.16/index.php(50): MediaWiki->run()
#47 /srv/mediawiki/php-1.41.0-wmf.16/index.php(46): wfIndexMain()
#48 /srv/mediawiki/w/index.php(3): require(string)
#49 {main}

Seems related to heavy use of math tag. Should have worked until monday if we can trust https://pageviews.wmcloud.org/?project=de.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Holomorphe_Funktion

Event Timeline

If an article contains <math> at least 900 times, it usually has this error

Here are the logs related to this error:
https://logstash.wikimedia.org/goto/89f993a3c45d8849be7347260bb7c028

The restbase change was applied at this timestamp: https://sal.toolforge.org/log/03geRYkBxE1_1c7sKPgl

Which means that the errors dont necessarily correlate with the restbase/mathoid change

Errors started on Monday July 10, around 17:00 UTC. Error rate increased over the following days. The errors come in waves that correlate with CEST business hours. The recent spike may be cause by people investigating the issue.

All errors originatze from dewiki. Pages on enwiki and other big wikis that use a lot of math formulas still work, even though they all run wmf.16 (since last Thursday, I assume). The admin log shows nothing relevant on July 10: https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2023-07-10

This indicates that the problem is in the content of the respective pages on dewiki, rather than being caused by a change in code or configuration.

The error I see on the linked page is

Die maximale Anfragezeit von 60 {{PLURAL:60|Sekunde|Sekunden}} wurde überschritten.
[b3623713-4117-4cce-97c8-1312034e92e3] 2023-07-12 11:23:35: Fataler Ausnahmefehler des Typs „Wikimedia\RequestTimeout\RequestTimeoutException“

In addition, I see an unexpanded template in the menu.

{{Gender|Deine}} Meldungen

The fact that the timeout triggers in the math code doesn't neccessarily mean it is caused by the math code. It may also be caused by somthing that runs before math handling that got slower.

Mathoid is almost sleeping. I remember we used to see 100+ requests/s.

Mathoid is almost sleeping. I remember we used to see 100+ requests/s.

I'm not familiar with the dashboard in question, but the request rate looks very stable at 5 reqs/sec over the last year. I think the >100req/s is before caches. A hit rate of 95% for math formulas sounds about right to me.

Yes, I just wanted to indicate that the mathoid server could handle more requests.

I just searched on phabricator for related errors... so far, without success. My suspicion is that a template that is often used in connection with formulae was changed.

Aklapper changed the subtype of this task from "Task" to "Production Error".Jul 12 2023, 1:31 PM

I just searched on phabricator for related errors... so far, without success. My suspicion is that a template that is often used in connection with formulae was changed.

Yea, but I don't see anything obvious... https://de.wikipedia.org/wiki/Spezial:%C3%84nderungen_an_verlinkten_Seiten?hidecategorization=1&target=Holomorphe_Funktion&namespace=10&limit=50&days=7&urlversion=2. The changes on July 10 are not even edits, they are language link changes coming from Wikidata...

Exactly. I also tried to create a copy of that page on beta, but I get

Request from 132.195.*.* via deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud, ATS/9.1.4
Error: 504, Connection Timed Out at 2023-07-12 15:04:02 GMT

It has been just reported by a dewiki user that LaTeX preference works fine but SVG crashes. How comes?

It has been just reported by a dewiki user that LaTeX preference works fine but SVG crashes. How comes?

I can confirm that. However, in my browser, it takes longer than a minute for the page https://de.wikipedia.org/wiki/Holomorphe_Funktion to load (with text mode). So a combination of different effect seems to lead to the fact, that the overall page loading times add up to the value that is larger than the timeout.

@daniel Do you think it would be a good idea to increase the number of concurrent requests for formulae (default 50)?

@xSavitar did an analysis and shared it on Slack. Here are the most important bits:

Using Excimer UI to profile that request, the issue seems to be coming from Cite and Parser - https://performance.wikimedia.org/excimer/profile/e30c5c689b8329a5
The link I shared above (Excimer), from Cite\ReferencesFormatter::formatListItem, 97% of the request was consumed. So Cite has something going on there.

@daniel Do you think it would be a good idea to increase the number of concurrent requests for formulae (default 50)?

I can onl pass that questions to the SRE folks... @jijiki @Clement_Goubert thoughts?

To explain my thinking:

If an article contains <math> at least 900 times, it usually has this error

I checked in the config directory, and the default is set to 150. If we estimate the $MathMathMLTimeout =20 as an upper boundary, this would result in more than 4*20s = 1m 20s. Increasing that to 300 would result in a higher load for restbase but reduce the upper bound to 40s.

I think it is worth a try, but a brave dude would need to test this in a deployment window and roll back the change if restbase can't stand it.

To explain my thinking:

If an article contains <math> at least 900 times, it usually has this error

I checked in the config directory, and the default is set to 150. If we estimate the $MathMathMLTimeout =20 as an upper boundary, this would result in more than 4*20s = 1m 20s. Increasing that to 300 would result in a higher load for restbase but reduce the upper bound to 40s.

I think it is worth a try, but a brave dude would need to test this in a deployment window and roll back the change if restbase can't stand it.

After looking at the performance dump, it does not seem to make sense. I wonder why there are so many calls to MathML classes for formatting references.

When developing the runMulti approach to render formulae in parallel we did assume that the onParserAfterTidy hook would be called once. This seems not to be the case for that example page.

When developing the runMulti approach to render formulae in parallel we did assume that the onParserAfterTidy hook would be called once. This seems not to be the case for that example page.

Even if parsing completes? When parsing times out, I'd expect the hook not getting called afterwards... Nothing gets called after a timeout.

When developing the runMulti approach to render formulae in parallel we did assume that the onParserAfterTidy hook would be called once. This seems not to be the case for that example page.

Even if parsing completes? When parsing times out, I'd expect the hook not getting called afterwards... Nothing gets called after a timeout.

I checked the code. Each formula is rendered only once. After completion, the formula is removed from the array. So I am out of ideas for now.

When developing the runMulti approach to render formulae in parallel we did assume that the onParserAfterTidy hook would be called once. This seems not to be the case for that example page. […]

Theory 1: Slow due to small serial batches, due to nesting

This is (almost) correct. The onParserAfterTidy hook runs once "per Parser run". We generally think of pages being parsed in their entirety, and this is true for most intents and purposes. However, Cite's <ref> tags have the lovely feature of being recursive, and thus there is an onParserAfterTidy run after each <ref> invocation. This tends not to matter much for extension hooks, because their work is likely not called for, given that features like Math are generally not invoked in citations.

The stack trace in the task description, as well as the Excimer profile, strongly suggest that there is a <math> tag inside a <ref> tag. Looking in https://de.wikipedia.org/wiki/Holomorphe_Funktion?action=edit, I don't see matches for <ref>[^>]+<math so my theory is either wrong, or it's something more indirect, e.g. through a template, some other indirection, or perhaps due to a spelling mistake in the article's syntax causing something to become part of a reference that isn't intended to be.

Theory 2: Slow due to small serial batches, due to global state

If we delve into unlikely theories, another theory I have is that perhaps there is something in how Math handles the onParserAfterTidy hook, causing it to process <math> tags that do not belong to the current Parser invocation. For example, if it is sharing state in some way, then perhaps it is taking care of a chunk of <math> tags upto that point, every time a `<ref> tag is seen. That might be another explanation.

Theory 3: Slower because more math tags

I think @daniel ruled this out already, as indeed previous versions with approximately the same number of references and math tags did render quickly. In subsequent theories below, I'll operate on the assumption that the number of math tags on this page has not notably increased.

Theory 4: Slow due to lack of batching at curl level

If we assume for a moment that the stack trace has an alternate explaination, and that we are in fact submitting everything in one large batch to curl-multi, then one theory for the slow response (45 seconds spent) is that perhaps it is not actually dispatching all the requests at once. There may be some upper limit we're hitting to how many requests are actually sent to RESTBase at the same time. This could be due to logic in the MWHttpFactory wrapper, or a php-curl limit, or perhaps something in the outgoing Envoy proxy.

Theory 5: Slow due to rate limiting at RESTBase/Mathoid level

If we assume everything is sent at once in one large batch and received concurrently by RESTBase without throttling on the MW client side, then another issue may be that there is rate limiting ocurring causing the requests to effectively become processed in a series of small chunks. The question then becomes: "Why now?" given that it used to be fine. The temporaral cause there may have to do with Parsoid cache warming (T329366: Enable WarmParsoidParserCache on all wikis for T320534: Put Parsoid output into the ParserCache on every edit), which has significantly increased internal traffic from Parser/Parsoid in production. Thus some kind of (unintentional?) shared rate limit that is applied to internal IPs may be getting hit now. This would be similar to T339863: Thumbor's use of poolcounter is rate limiting Kubernetes IPs

Conclusion

I think Theory 1-2 are most likely because the stack trace and Excimer profile clearly indicates that Math is getting called from the Cite stack trace. Further more, if we use the "Time Order" view in Excimer UI, you can see without a doubt that these are indeed separate function calls (because there is whitespace between them) to Math's parser hook, and separate calls to curl.

This to me is hard evidence that we are making multiple non-overlapping requests to RESTBase.

Screenshot 2023-07-13 at 17.22.36.png (1×2 px, 272 KB)

There is, I supose, a small chance that this does not mean there is no large batching. For example, maybe these are hundreds of accidental cheap requests, and there is still (also) a big bach at the end. The problem with that is, all the calls seem about the same size, and not a single one in the Excime r profile is from the "main" onParserAfterTidy hook invocation. All calls to MathMathML::batchEvaluate are from a Cite-triggered onParserAfterTidy hook. This may be hard to confirm visually by eye in the Time Order or Left Heavy views, but the "Sandwhich" views makes this easy to see.

Screenshot 2023-07-13 at 17.25.06.png (1×2 px, 429 KB)

100% of parents of MathMathML::batchEvaluate are Cite. That probably means there are no <math> tags after the last Cite <ref>. Or, it means that <ref> is processed after <math>. I suppose one more theory could be that the same set of 1000 <math> tags is batch-evaluated repeatedly in full, every <ref> tag. E.g. if Parser processes all <math> tags before the Parser handles the Cite <ref> tags. For this to be true, Math would need to have no logic that unsets or satisfies the internal buffer of Math entries after it gets the response from RESTBase.

@Krinkle thank you for that insightful analysis.
To get more evidence, either way, I was trying to reproduce it on another server. https://de.beta.math.wmflabs.org/wiki/Holomorphe_Funktion . There I see

Fatal error: Maximum execution time of 30 seconds exceeded in /var/www/html/includes/libs/http/MultiHttpClient.php on line 272

I don't know if this helps.

@Physikerwelt Could you confirm or rule out Theory 1 or 2? I think this would be reproducible on localhost or the wmflabs test server indeed. A page with a small number of simple <math> and <ref> (alternating one after the other, no nesting, perhaps 10 in total), and then logging the function calls that enter batchEvaluate to see what happens after editing or purging. I.e. is it one or is it multiple, and are they different batches or repeat payloads, if different, what appears (from obseration) to be the reason for which math tag ends up in which batch?

@Physikerwelt Could you confirm or rule out Theory 1 or 2?

Not yet.

I think this would be reproducible on localhost or the wmflabs test server indeed.

I took a section of the page Holomorphe_Funktion

A page with a small number of simple <math> and <ref> (alternating one after the other, no nesting, perhaps 10 in total), and then logging the function calls that enter batchEvaluate to see what happens after editing or purging.

Logging and inserting breakpoints on local system shows that the function is called multiple times, but the array of math formulae is filled only on the first call. This is to be expected.

I.e. is it one or is it multiple, and are they different batches or repeat payloads, if different, what appears (from obseration) to be the reason for which math tag ends up in which batch?

So there are multiple, but only one is no no-op.

I suggest we test increasing $wgRequestTimeLimit = 60; to a higher value for dewiki for now. While this won't solve the underling problem, at least one can read the pages (while one has to wait for more than one minute to edit it).

An error report likely about this issue has now cropped up on English WP as well at https://en.wikipedia.org/wiki/WP:VPT#Severe_slowdown_for_math-heavy_pages .

For whatever it's worth, this bug definitely is not caused by article content changes of the past 2 weeks. It affects some pages that have not been modified in that timeframe.

It definitely has something to do with the combination of references and math tags. On multiple affected math-heavy articles which take 80+ seconds to render an edit preview, if I entirely remove all of the references in the page markup, preview rendering is very fast, only a few seconds. But it's not only references that matter; for instance, the article about Donald Trump with >800 footnotes takes <9 seconds to render a preview, whereas some of the afflicted pages only have dozens of footnotes.

As with the German experience, on English Wikipedia there seems to be a time-of-day effect. I haven't tried carefully tracking it, but it seems to be worst during evening hours in the USA, and not as bad in the wee hours of the morning.

Urbanecm_WMF renamed this task from Wikimedia\RequestTimeout\RequestTimeoutException on de:Holomorphe_Funktion to Wikimedia\RequestTimeout\RequestTimeoutException on de:Holomorphe_Funktion and several other math-heavy articles.Jul 29 2023, 11:35 AM
Urbanecm_WMF added a subscriber: Winstondegreef.
Physikerwelt claimed this task.