Page MenuHomePhabricator

Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors
Open, Needs TriagePublicPRODUCTION ERROR

Assigned To
None
Authored By
Samwalton9
Mar 23 2025, 11:37 AM
Referenced Files
F58975339: emergency-exception-rate.png
Thu, Apr 3, 10:14 PM
F58957308: emergency-timeout.png
Tue, Apr 1, 7:14 AM
F58932291: Screenshot 2025-03-27 at 17.44.12.png
Mar 28 2025, 12:44 AM
F58932255: Screenshot 2025-03-27 at 17.26.19.png
Mar 28 2025, 12:44 AM
F58932237: Screenshot 2025-03-27 at 17.16.20.png
Mar 28 2025, 12:44 AM
F58927937: 2025-03-26-13:51:46.png
Mar 27 2025, 7:38 PM
F58910302: image.png
Mar 24 2025, 4:25 PM
F58910199: image.png
Mar 24 2025, 4:17 PM
Tokens
"Barnstar" token, awarded by brennen.

Description

In the last week there have been a number of reports of the following error:

Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException"

See this WP:VPT thread: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Error_message

I've also seen a few posted on the Wikipedia Discord server, e.g.:

image.png (187×1 px, 26 KB)

Seems to be related to opening page histories or user contributions pages.


Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::commit critical section while session state is out of sync.

A critical section from Wikimedia\Rdbms\Database::executeQuery has failed
#0 /srv/mediawik
FrameLocationCall
from/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/Database.php(3091)
#0/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/Database.php(2362)Wikimedia\Rdbms\Database->commenceCriticalSection(string)
#1/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/Database.php(2533)Wikimedia\Rdbms\Database->commit(string, string)
#2/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1676)Wikimedia\Rdbms\Database->flushSnapshot(string)
#3/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/lbfactory/LBFactory.php(346)Wikimedia\Rdbms\LoadBalancer->flushReplicaSnapshots(string)
#4/srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(661)Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string)
#5/srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(496)MediaWiki\MediaWikiEntryPoint->restInPeace()
#6/srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(454)MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown()
#7/srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(211)MediaWiki\MediaWikiEntryPoint->postOutputShutdown()
#8/srv/mediawiki/php-1.44.0-wmf.21/index.php(58)MediaWiki\MediaWikiEntryPoint->run()
#9/srv/mediawiki/w/index.php(3)require(string)
#10{main}
Impact
Notes
Cannot execute Wikimedia\Rdbms\Database::commit critical section while session state is out of sync.

A critical section from Wikimedia\Rdbms\Database::executeQuery has failed
#0 /srv/mediawiki/php-1.44.0-wmf.21/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(44): Wikimedia\Rdbms\Database->Wikimedia\Rdbms\{closure}(541)
#1 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/Database.php(638): Wikimedia\RequestTimeout\CriticalSectionScope->__destruct()
#2 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/Database.php(1365): Wikimedia\Rdbms\Database->query(Object(Wikimedia\Rdbms\Query), 'MediaWiki\\Exten...')
#3 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->select(Array, Array, Array, 'MediaWiki\\Exten...', Array, Array)
#4 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(351): Wikimedia\Rdbms\DBConnRef->__call('select', Array)
#5 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(762): Wikimedia\Rdbms\DBConnRef->select(Array, Array, Array, 'MediaWiki\\Exten...', Array, Array)
#6 /srv/mediawiki/php-1.44.0-wmf.21/extensions/Disambiguator/includes/Lookup.php(85): Wikimedia\Rdbms\SelectQueryBuilder->fetchResultSet()
#7 /srv/mediawiki/php-1.44.0-wmf.21/extensions/Disambiguator/includes/Hooks.php(163): MediaWiki\Extension\Disambiguator\Lookup->filterDisambiguationPageIds(Array)
#8 /srv/mediawiki/php-1.44.0-wmf.21/includes/HookContainer/HookContainer.php(155): MediaWiki\Extension\Disambiguator\Hooks->onGetLinkColours(Array, Array, Object(MediaWiki\Title\Title))
#9 /srv/mediawiki/php-1.44.0-wmf.21/includes/HookContainer/HookRunner.php(1941): MediaWiki\HookContainer\HookContainer->run('GetLinkColours', Array)
#10 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/LinkHolderArray.php(232): MediaWiki\HookContainer\HookRunner->onGetLinkColours(Array, Array, Object(MediaWiki\Title\Title))
#11 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/LinkHolderArray.php(155): MediaWiki\Parser\LinkHolderArray->replaceInternal('<p><!--LINK'" 4...')
#12 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/Parser.php(5176): MediaWiki\Parser\LinkHolderArray->replace('<p><!--LINK'" 4...')
#13 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/Parser.php(1701): MediaWiki\Parser\Parser->replaceLinkHoldersPrivate('<p><!--LINK'" 4...')
#14 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/Parser.php(703): MediaWiki\Parser\Parser->internalParseHalfParsed('<p><!--LINK'" 4...', true, true)
#15 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/MessageParser.php(191): MediaWiki\Parser\Parser->parse('<!--LINK'" 4:0-...', Object(MediaWiki\Title\Title), Object(MediaWiki\Parser\ParserOptions), true)
#16 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/MessageParser.php(156): MediaWiki\Language\MessageParser->parseWithoutPostprocessing('[[WP:Featured a...', Object(MediaWiki\Title\Title), true, true, Object(LanguageEn))
#17 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/Message/Message.php(1482): MediaWiki\Language\MessageParser->parse('[[WP:Featured a...', Object(MediaWiki\Title\Title), true, true, Object(LanguageEn))
#18 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/Message/Message.php(1072): MediaWiki\Message\Message->parseText('[[WP:Featured a...')
#19 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/Message/Message.php(1114): MediaWiki\Message\Message->format('parse')
#20 /srv/mediawiki/php-1.44.0-wmf.21/includes/changetags/ChangeTags.php(1415): MediaWiki\Message\Message->parse()
#21 /srv/mediawiki/php-1.44.0-wmf.21/includes/changetags/ChangeTags.php(774): ChangeTags::getChangeTagList(Object(MediaWiki\Context\RequestContext), Object(LanguageEn), true, true)
#22 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/fields/HTMLTagFilter.php(75): ChangeTags::buildTagFilterSelector('', true, Object(MediaWiki\Context\RequestContext), true, true)
#23 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/OOUIHTMLForm.php(177): MediaWiki\HTMLForm\Field\HTMLTagFilter->getOOUI('')
#24 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/HTMLForm.php(2040): MediaWiki\HTMLForm\OOUIHTMLForm->formatField(Object(MediaWiki\HTMLForm\Field\HTMLTagFilter), '')
#25 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/HTMLForm.php(2057): MediaWiki\HTMLForm\HTMLForm->displaySection(Array, 'mw-htmlform-con...', 'contribs-top-', true)
#26 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/HTMLForm.php(1614): MediaWiki\HTMLForm\HTMLForm->displaySection(Array, '')
#27 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/OOUIHTMLForm.php(285): MediaWiki\HTMLForm\HTMLForm->getBody()
#28 /srv/mediawiki/php-1.44.0-wmf.21/includes/htmlform/HTMLForm.php(1425): MediaWiki\HTMLForm\OOUIHTMLForm->getBody()
#29 /srv/mediawiki/php-1.44.0-wmf.21/includes/specialpage/ContributionsSpecialPage.php(990): MediaWiki\HTMLForm\HTMLForm->getHTML(true)
#30 /srv/mediawiki/php-1.44.0-wmf.21/includes/specialpage/ContributionsSpecialPage.php(319): MediaWiki\SpecialPage\ContributionsSpecialPage->getForm(Array)
#31 /srv/mediawiki/php-1.44.0-wmf.21/includes/specialpage/SpecialPage.php(729): MediaWiki\SpecialPage\ContributionsSpecialPage->execute('EncycloG')
#32 /srv/mediawiki/php-1.44.0-wmf.21/includes/specialpage/SpecialPageFactory.php(1737): MediaWiki\SpecialPage\SpecialPage->run('EncycloG')
#33 /srv/mediawiki/php-1.44.0-wmf.21/includes/actions/ActionEntryPoint.php(501): MediaWiki\SpecialPage\SpecialPageFactory->executePath('Contributions/E...', Object(MediaWiki\Context\RequestContext))
#34 /srv/mediawiki/php-1.44.0-wmf.21/includes/actions/ActionEntryPoint.php(145): MediaWiki\Actions\ActionEntryPoint->performRequest()
#35 /srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(202): MediaWiki\Actions\ActionEntryPoint->execute()
#36 /srv/mediawiki/php-1.44.0-wmf.21/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#37 /srv/mediawiki/w/index.php(3): require('/srv/mediawiki/...')
#38 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptMar 23 2025, 11:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

T378138 is the only other task I could find that might be related to this?

Reedy subscribed.

The one screenshotted specifically is via SpecialContributions but MediaWiki-extensions-Disambiguator is implicated in the stack

Reedy updated the task description. (Show Details)

I got this error on a couple of pages just now, not sure if it's related to this.

image.png (245×1 px, 38 KB)

matmarex subscribed.

Logs show increased error rate for this exception starting about a week ago: https://logstash.wikimedia.org/goto/3498b9932b61d75ae1ae39a231279e40

image.png (267×1 px, 41 KB)

The errors occur on many different special pages and actions. This seems like a general WMF config issue, rather than a problem with any specific component.

A couple of data points:

The Wikimedia\Rdbms\DBUnexpectedError seems to be a consequence of the Wikimedia\RequestTimeout\EmergencyTimeoutException encountered earlier in processing the request - i.e., at [0] after $this->csmError is set on the Wikimedia\Rdbms\Database instance when the implicit-exit callback runs on the CriticalSectionScope (see L3106).

That seems to be consistent with the EmergencyTimeoutException being thrown after the 180s critical timeout has expired while we're in mysqli::query call in DatabaseMySQL.php at L667 (i.e., upon return from C code), since in that case the scope's dtor will run (we never get to the code that explicitly exits the scope).

Here's one fairly representative example I'm looking at: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2025.03.24?id=Q1v4x5UBfOjk-Vo1AEI0

If you search for reqId: e81a219b-3ca1-43e4-b688-86e46fed98cd in logstash, you'll see the associated DBUnexpectedErrors (https://logstash.wikimedia.org/goto/1a58a6bdb658114996eba1db87b735ff).

Now, what's a little odd here is what happens when we find that reqId in the mediawiki access logs: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-7-2025.13?id=GVz4x5UBF1zH-YDaAwHV

We see a 500 response code, as one would expect, however the event.duration is 705,963 micros ... which is clearly far less than 180s.

Working backwards from other 500 responses returned by mw-web in the access logs, I see the same pattern - e.g. https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-7-2025.13?id=RDKkyJUBaJATmKuIiNvO (duration 1,371,330 micros) vs. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2025.03.24?id=6DqkyJUBF1zH-YDaiEyq (which is ostensibly a critical section timeout at 180s).

If we look at the rate of EmergencyTimeoutException due to the Wikimedia\Rdbms\Database 180s critical section timeout since January, we see: https://logstash.wikimedia.org/goto/85ba3cb039de25945182a7b0d676ca3c

This looks like it has been happening for quite a while now, but is potentially exacerbated by load - i.e., the uptick since 2025-03-18 correlates with entering the single-DC serving phase of the March datacenter switchover, where we're serving all traffic from eqiad.

This also doesn't seem to be a PHP 7.4 vs. 8.1 issue - we were consistently seeing this on both.

I'm staring at Database.php, the RequestTimeout library, and the excimer code, and I'm having a hard time squaring how one might potentially get a spurious timeout on the critical section timer (or a leaked timer?). Either that, or the duration reported in the access logs is bogus.

We probably need some folks with experience in the Rdbms library and / or excimer to look at this more deeply.

[0] https://gerrit.wikimedia.org/g/mediawiki/core/+/debaa0bc7080a66bfefcda00e712847598fb7b70/includes/libs/rdbms/database/Database.php#3091

Either that, or the duration reported in the access logs is bogus.

It's anecdata but I saw in one of the places this was reported that the page returned that error almost instantly, which they remarked was weird for a timeout error.

This is a GET for enwiki/Special:RecentChangesLinked/$sometitle. It is fails with EmergencyTimeoutException: The critical section "Wikimedia\Rdbms\Database::executeQuery" timed out after 180 seconds , but the Apache access log tells us the request took 0.7s.

Special:RecentChangesLinked is a power-user feature, known to be slow for certain inputs. It has a 30s execution timeout on its main query and thus should stay well below both the 60s HTTP timeout (https://wikitech.wikimedia.org/wiki/HTTP_timeouts) and the 180s critical section timeout.

What's interesting to me is that the query that times out, is not in fact the "main" query made during this response. The MediaWiki logs above indicate that the "emergey critical section timeout" came from Lookup::filterDisambiguationPageIds in the Disambiguator extension. This is a fairly trivial query by primary key.

reqId:"aa5a3019-54b8-421c-906f-b3bd7dc5ae87"

This is a GET for http://en.wikipedia.org/wiki/Special:Contributions/2607:FEA8:55E0:AC00:8508:3972:7701:329D. Apache tell us it took 1.3s. PHP throws the same EmergencyTimeoutException after seemingly 180s in a critical section.

Unlike the previous URL, this one is consistently quick. The IP has a single edit, and no advanced filters are active. There's no reason for this to take more than 2s.

Like the previous one, the "emergency" timeout isn't even on the main query. When MediaWiki is parsing the interface text above the list of 1 contribution(s), the Parser is doing a trivial database lookup for a template in the legend text. It's that trivial primary key query that supposedly is timing out.

The critical section "Wikimedia\Rdbms\Database::executeQuery" timed out after 180 seconds

#0 /srv/mediawiki/php-1.44.0-wmf.21/includes/libs/rdbms/database/DatabaseMySQL.php(667): Wikimedia\RequestTimeout\Detail\CriticalSection::Wikimedia\RequestTimeout\Detail\{closure}(int)
…
#9 /srv/mediawiki/php-1.44.0-wmf.21/includes/Revision/RevisionStore.php(1397): RevisionStore->loadSlotRecordsFromDb(int, int, MediaWiki\Title\Title)
…
#14 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/Parser.php(3800): RevisionRecord->hasSlot(string)
#15 /srv/mediawiki/php-1.44.0-wmf.21/includes/parser/Parser.php(3668): Parser::statelessFetchTemplate(Title, Parser)
…
#36 /srv/mediawiki/php-1.44.0-wmf.21/includes/language/Message/Message.php(1114): Message->format(string)
…
#38 /srv/mediawiki/php-1.44.0-wmf.21/includes/changetags/ChangeTags.php(774): ChangeTags::getChangeTagList(RequestContext, LanguageEn, bool, bool)
…
#46 /srv/mediawiki/php-1.44.0-wmf.21/includes/specialpage/ContributionsSpecialPage.php(990): MediaWiki\HTMLForm\HTMLForm->getHTML(bool)

Following up, I spent a bit more time in the code earlier today, and I remain puzzled as to how we could end up with spurious critical section timeouts like we seem to be observing here.

@Krinkle or @tstarling - If either of you might have ideas for what kind of information might be useful to collect in order to debug this further, I can help facilitate that.

For example, if there are some limited / low-risk changes that we could make in excimer in order to log some details when invoking timer callbacks (e.g., if it can be inferred that a given timer specifically "looks like" a critical section timer, possibly by way of the period = 0 initial = 180 it was started with), I could perhaps build and deploy a patched package.

The Village pump thread continues to have more reports, and certainly confirms beyond a doubt that these are shown "immediately", so definitely the criticial sectioon timer callback is invoked too soon.

Logstash shows the URLs include article view, article history, special pages, api.php, rest.php. No particular action, wiki, db cluster, or interface language.

Logtash shows it affecting kube-mw-web, kube-mw-api-ext but also kube-mw-api-in and kube-mw-jobrunner.

Ideally we'd catch one on mw-debug with verbose logging to Logstash enabled, but I don't seen any in the last 30 days, and haven't been encountered or been able to reproduce it myself either.

Looking at the source code in the MediaWiki Rdbms lib, and the wikimedia/request-timeout lib, it's actually not obvious to me why criticial sections have their own dedidated timeout. I get why they should not last indefinitely as that would mean an unclosed section would basically disable the overall request timeout (by indefinitely buffering and post-poning a request timeout excepetion) and thus fallback to a hard kill at the CGI level rather than a clean shutdown. However, it seems like these could perhaps be automatically aligned with the request timeout (with e.g. a 1 second margin) rather than theire own separately assigned timeout duration. However that should be orthoginal to this.

If it is or becomes too widespread we could certainly also disable the critical section timeout, since it (afaik) primarily to detect mistakes in source code. Under normal operation, the request timer should suffice. The "emergency" timeout can be set to 0 or INF to disable it. This does not disable "critical section" protection, but rather disables the maximum duration for such section to last. So in the event that there is a (new) bug in the Rdbms code that neglects to close such a section, then requests that both trip that code and also take longer than the request timeout won't throw TimeoutException with good logging telemetry and clean shutdown, but instead get killed by CGI. That seems certainly an improvement over randomly crashing real-user traffic.

Random guess that came to mind: Is it possible that something starts a critical section but never ends it, and then 180 seconds later the same worker (even if it is responding to a completely different request) throws?

A quick spot check this afternoon after we re-pooled codfw around 14:15 UTC today:

On the theory that the underlying issue is exacerbated by load, we expect that returning to multi-DC serving should reduce the rate of exceptions.

Arbitrarily picking the 22:00 UTC hour today (2025-03-26) and comparing with the same hour yesterday (2025-03-25), we see 47 vs. 616 EmergencyTimeoutExceptions being thrown due to critical section timeouts, respectively, which is consistent with that theory.

If it is or becomes too widespread we could certainly also disable the critical section timeout, since it (afaik) primarily to detect mistakes in source code. Under normal operation, the request timer should suffice. The "emergency" timeout can be set to 0 or INF to disable it. [...]

Ah, indeed it can be - overriding CriticalSectionTimeLimit seems like a solid option to keep in mind if this significantly worsens again.

Random guess that came to mind: Is it possible that something starts a critical section but never ends it, and then 180 seconds later the same worker (even if it is responding to a completely different request) throws?

Definitely a reasonable guess!

So, if the critical section is never explicitly exited, I would expect one of a couple of layers of defense to clean up the timer:

  • The CriticalSectionScope destructor, which, a couple of layers down, should result in a call to stop in the underlying ExcimerTimer instance.
  • The ExcimerTimer free_obj handler, which, among other things, should run the same cleanup code as stop.
  • If all else fails, the post-deactivate hook in the excimer module, which should run the same cleanup code on all remaining timers created over the lifetime of the request.

That said, I don't know PHP internals anywhere near well enough to state whether there are scenarios where all of these could be skipped (and the worker serve further requests).

@Krinkle or @tstarling - If either of you might have ideas for what kind of information might be useful to collect in order to debug this further, I can help facilitate that.

Ideally I would like to have a reproducible test case. I'm not fussy, it can be in production, and it can be something that happens randomly as long as you can make it happen in less than an hour on a given server.

On the theory that the underlying issue is exacerbated by load, we expect that returning to multi-DC serving should reduce the rate of exceptions.

So hit a server hard with ab, choosing a URL that hits the stack traces you see in the exceptions that occur naturally. For purposes of isolating a bug like this, we want it to be as exacerbated as possible.

brennen subscribed.

It seems like the right folks are already aware of this, but noting that log weather feels unusually bad at the moment, and subjectively I've had a lot of conversations with deployers and train conductors over the last week prompted by the volume of these and possibly-related errors.

2025-03-26-13:51:46.png (240×1 px, 61 KB)

[…]
Logstash shows the URLs include article view, article history, special pages, api.php, rest.php. No particular action, wiki, db cluster, or interface language.

Logtash shows it affecting kube-mw-web, kube-mw-api-ext but also kube-mw-api-in and kube-mw-jobrunner.

So... it's pretty much evenly distributed at the level above the container (user request) and below (database, memcached). What about the pods themselves? Scott already looked at correlating with specific k8s pods and/or their uptime or stage of their lifeccle and found nothing. I wanted to have a second look and write some of it down in the task.

At glance pods seem affected equally. But are they generally affected close to startup? (during deployment), or does it start later?, or perhaps during shutdown?

Looking at mw-web.eqiad.main-db878dd57-m4fkz for example (in Logstash mediawiki-errors dash), if i disable the "Error channels" filter briefly, I can deduce this pod was booted around 23:05 and was up for ~9h until 08:25. It experienced exactly 2 requests with EmergencyTimeoutException, at 00:17 and 00:20.

mw-web.eqiad.main-6c9d5d64f6-t9ctg was up for only 30min, from 22:45 to 23:10, and experienced exactly 1, at 23:04.

There isn't any obvious correlation here, but to some extent that's a good thing. If it truly affects them all equally not too long after they start, then we ought to be able to reproduce it. For it to be hard to reproduce, there would have to be something unpredictable about it, and as annoying as this is for us and end-users, the fact that it happens to all pods not to long into their life, should make it reproducible.

What I haven't considered yet though, is the "unknowns". Among the pods whose name we see in Logstash as reporting this error, it seems to affect them all equally. But, there could of course be hundreds more pods that are not affected. I haven't compared these to a k8s listing of all pods. But, there are at least enough different pod names in Logstash that it seems likely reproducible, and the overall bar charts counting the instances plot a constant enough line as well. So while it might rely on a freak accident to happpen for one pod/request, holistically it happens fairly predictably across the board.

Screenshot 2025-03-27 at 17.16.20.png (720×1 px, 93 KB)

[…] Ideally I would like to have a reproducible test case. I'm not fussy, it can be in production, and it can be something that happens randomly as long as you can make it happen in less than an hour on a given server. […]

So hit a server hard with ab, choosing a URL that hits the stack traces you see in the exceptions that occur naturally. For purposes of isolating a bug like this, we want it to be as exacerbated as possible.

I've picked a URL and am now hitting mwdebug1002 with this URL to see if we can cause it on purpose.

I settled on https://en.wikipedia.org/wiki/Special:Log/SeaSplashOctopod which I found in Logtash sometime in the last 90 days, that I like because it ticks these boxes: similar to mwdebug1002: in eqiad, on PHP 7.4, in the mw-web cluster, and: a URL that I know to be simple and fast in what it should do in MW, and is a URL that I know not to make risky or problematic mysql queries (i.e. don't bring down mysql while doing this).

Given that special pages likek Special:Contributions and such do involve some rate limiting I didn't want to start out with a ton of concurrency. Just a high number and let it roll the dice for a while. I give it 10,000 which may as well have been infinity at concurrency 2. The idea is to kill it once you have a hit for EmergencyTimeoutException on the mwdebug Logstash dashboard.

krinkle@mwdebug1001:~$ ab -n 10000 -c 2 -H 'Host: en.wikipedia.org' 'https://mwdebug1001.eqiad.wmnet/wiki/Special:Log/SeaSplashOctopod'

After about 15 minutes, Logstash showed a handful of hits. I kept the dashboard low in filtering, merely channel:exception not just EmergencyTimeoutException so that if anything else starts happening, I'll know.

Exiting ab says it had done about 1400 requests by then, so I decided to repeat it, and got it again.

Screenshot 2025-03-27 at 17.26.19.png (1×2 px, 446 KB)

The following seems to work fairly well:

krinkle@mwdebug1001:~$ ab -n 1000 -c 3 -H 'Host: en.wikipedia.org' 'https://mwdebug1001.eqiad.wmnet/wiki/Special:Log/SeaSplashOctopod'

Starting at 00:25 UTC:

  • +00:05 (req 200): nothing.
  • +00:06 (req 300): 2 requests fail with EmergencyTimeoutException.
  • +00:07 (req 400): 1 more fails with EmergencyTimeoutException.
  • +00:12 (req 700): nothing.
  • +00:13 (req 800): 1 more fails with EmergencyTimeoutException.
  • +00:17 (req 900): 1 more fails with EmergencyTimeoutException.
  • +00:18 (req 1000) nothing, done.

Screenshot 2025-03-27 at 17.44.12.png (1×2 px, 149 KB)

Many thanks for demonstrating how to reproduce this, @Krinkle. I'm quite pleasantly surprised that you were able to do so within only a couple hundred requests.

Unless anyone has more sophisticated ideas in the meantime, I would like to temporarily deploy a lightly modified rebuild of excimer.so on mwdebug1001 and repeat the tests from T389734#10685360 - basically with some additional logging around CriticalSection-related timer events and carrying some additional debugging state (e.g., current-request vs. request-at-time-start) - to rule out some "should not happen" scenarios.

I plan to do this on Monday during a time window free of deployments.

Removing SRE and adding serviceops-radar.

Pppery renamed this task from Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors to Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors.Sun, Mar 30, 11:24 PM
Pppery added a subscriber: LD.

Hi, I got several of these errors this morning (CET) while accessing pages on Commons: https://commons.wikimedia.org/wiki/Commons:Undeletion_requests/Current_requests

MediaWiki internal error.

Original exception: [08c8f677-a901-467d-ae8c-d573372e72a3] 2025-03-31 06:52:35: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError"

Exception caught inside exception handler.

Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.
Original exception: [08c8f677-a901-467d-ae8c-d573372e72a3] 2025-03-31 06:52:35: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError"
Logstash record
[08c8f677-a901-467d-ae8c-d573372e72a3]
DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.

This is T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded".

Mentioned in SAL (#wikimedia-operations) [2025-03-31T17:14:07Z] <swfrench-wmf> attempting to reproduce T389734 with enhanced logging on mwdebug1001

Alright, after deploying a patched excimer 1.1.1 .so file to mwdebug1001 that simply adds some rudimentary logging (P74507), I have some curious observations:

Although I can readily reproduce the EmergencyTimeoutException using the procedure from T389734#10685360, I can't seem to find any evidence that the VM interrupt handler (excimer_timer_interrupt) ever executes a callback [0] clearly associated with a critical-section (180s) timer event.

Specifically, the main thing that patch does is add some logging to the interrupt handler to note when:

  1. any one-shot timer (no period) event fires
  2. any timer event that specifically looks like a critical section timer (180s) fires
  3. any timer event fires that somehow appears to be from a different request than the current one (e.g., as if the timer was somehow leaked, despite all the defenses against that).

There's also a bit of logging around timer start, but that was only added very briefly at the end of the test to confirm indeed the logging was working at all.

None of these three conditions were met, despite EmergencyTimeoutException clearly happening.

Indeed, if "no one-shot timer event fires" holds, then I'm rather curious as to what might be invoking CriticalSection's emergency callback.

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/php/excimer/+/refs/tags/1.1.1/excimer_timer.c#324

On mwdebug1001, I installed excimer symbols and attached to some FPM worker processes with gdb. I set a breakpoint on ExcimerTimer_event.

Apologies for the rambling, I am mostly using this space to organise my own thoughts.

I intercepted an emergency timeout with a breakpoint on ExcimerTimer_event.

timers_by_id has three old garbage timers and one current one. It's the current one that apparently fired:

[EDIT --- SNIP]

The garbage was apparently just due to atttempting to single-step the program. The next two times I reproduced this, there was no garbage. Just four normal-looking timers: two profilers (CPU and wall clock) and two ExcimerTimer timers (request timeout and emergency timeout).

I'm pretty sure that the timers are not left over from a previous request. We're receiving an event with userdata containing the ID for the emergency timeout active in the current process. IDs are allocated in a monotonically increasing sequence during the lifetime of the FPM worker, so having a high ID identifies it as belonging to the current request.

When the event occurs, the object is valid and has a timeout of 180 seconds.

(gdb) print *(ExcimerTimer_obj*)($6.user_data)
$7 = {timer = {is_valid = 1, is_running = 1, vm_interrupt_ptr = 0x55b66aee18e2 <executor_globals+546> "", id = 744663, clock_id = 1, timer_id = 0x55b66bfc9850, 
    callback = 0x7f1c41fa33e0 <ExcimerTimer_event>, user_data = 0x7f1c215d5a80, event_counts_ptr = 0x7f1c41fad540 <excimer_timer_tls>, 
    thread_mutex_ptr = 0x7f1c41fad548 <excimer_timer_tls+8>}, period = {tv_sec = 0, tv_nsec = 0}, initial = {tv_sec = 180, tv_nsec = 0}, event_type = 0, z_callback = {...}, std = {...}}

I confirmed that the event occurred less than 180 seconds from the start of the request as recorded in sapi_globals.global_request_time.

The code in Excimer is really old. The package is from 2023 but the code was mostly written in 2018. I confirmed that the package predates mszabo's work. Here's the exception rates going back to January, by log filename so they're off by a day compared to the actual event timestamps:

emergency-timeout.png (798×1 px, 40 KB)

I'd really like to know if there was a time when the rate of these exceptions was zero, because that's what I'd expect it to be. At the start of January it was around 50 per day.

Excimer is not aware of concurrency or load. It doesn't use shared memory.

Did the kernel version change? The implementation of POSIX timers seems to be in the kernel, glibc is just a wrapper.

Emergency timeouts, unlike other kinds of timers, are started very frequently and are usually stopped and destroyed within 1ms after creation. So they might exercise some unusual code paths in the kernel.

[pid 12072] 1743490964.602461 timer_create(CLOCK_MONOTONIC, {sigev_value={sival_int=1812204880, sival_ptr=0x55b66c040d50}, sigev_signo=SIGRTMIN, sigev_notify=SIGEV_THREAD_ID, sigev_notify_thread_id=12152}, [3168441]) = 0
[pid 12072] 1743490964.602563 timer_settime(3168441, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=180, tv_nsec=0}}, NULL) = 0
[pid 12072] 1743490964.602656 timer_settime(3168441, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
[pid 12072] 1743490964.602763 timer_delete(3168441) = 0

I wonder if timer_gettime() would indicate that a non-zero amount of time is remaining. It's signal-safe, so we should be able to call it from excimer_timer_handle().

I think the next thing to try is a simple loop of these four syscalls. See if we can find anything weird going on without PHP or Excimer being involved.

My little loop was not successful after a couple of hours, but I was able to capture the full event in strace using @Krinkle's reproduction procedure. It does seem to confirm that there is a kernel issue.

SIGEV_THREAD in the application is translated by libc to SIGEV_THREAD_ID. libc starts a helper thread, here 22113, which waits for signals from the kernel using rt_sigtimedwait(). When it receives a signal, it finds its internal structure using sival_ptr, clones a new thread and passes the original userdata to it.

In the following strace output, a timer is created, then set to 180s expiry, then 366µs later, the libc helper thread is woken, with rt_sigtimedwait() producing the same sival_ptr that was just created.

22054 1743564752.546195 timer_create(CLOCK_MONOTONIC, {sigev_value={sival_int=-1101803440, sival_ptr=0x5592be53d050}, sigev_signo=SIGRTMIN, sigev_notify=SIGEV_THREAD_ID, sigev_notify_thread_id=22113}, [2006759]) = 0
22054 1743564752.546224 timer_settime(2006759, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=180, tv_nsec=0}}, NULL) = 0
22113 1743564752.546590 <... rt_sigtimedwait resumed> {si_signo=SIGRTMIN, si_code=SI_TIMER, si_timerid=0x1e9eb9, si_overrun=0, si_value={int=-1101803440, ptr=0x5592be53d050}}, NULL, 8) = 32 (SIGRTMIN)
22113 1743564752.546751 rt_sigtimedwait([RTMIN],  <unfinished ...>
...
22054 1743564752.547640 timer_settime(2006759, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
22054 1743564752.547681 timer_delete(2006759) = 0

The corresponding exception log entry arrived in logstash about 2ms later, at 03:32:32.549.

The same kernel timer ID was used as a periodic timer, deleted just 1ms before the event.

22054 1743564752.497749 timer_create(0xfffffffffffd4ece /* CLOCK_??? */, {sigev_value={sival_int=-1101803440, sival_ptr=0x5592be53d050}, sigev_signo=SIGRTMIN, sigev_notify=SIGEV_THREAD_ID, sigev_notify_thread_id=22113}, [2006713]) = 0
22054 1743564752.497786 timer_settime(2006713, 0, {it_interval={tv_sec=0, tv_nsec=20000000}, it_value={tv_sec=0, tv_nsec=20000000}}, NULL) = 0
...
22054 1743564752.545302 timer_gettime(2006713, {it_interval={tv_sec=0, tv_nsec=20000000}, it_value={tv_sec=0, tv_nsec=1}}) = 0
22054 1743564752.545343 timer_settime(2006713, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
22054 1743564752.545387 timer_delete(2006713) = 0
...
22113 1743564752.546590 <... rt_sigtimedwait resumed> {si_signo=SIGRTMIN, si_code=SI_TIMER, si_timerid=0x1e9eb9, si_overrun=0, si_value={int=-1101803440, ptr=0x5592be53d050}}, NULL, 8) = 32 (SIGRTMIN)

That last log line is our fatal event, with si_timerid=0x1e9eb9=2006713.

So that isolates it pretty well. timer_delete() failed to cancel an in-flight event, which was then misinterpreted by glibc as a newly-created timer reusing the same kernel ID.

Reduced test case, aborts in <1s.

gcc timer-stress.c -std=gnu17 -lrt -o timer-stress

1#include <signal.h>
2#include <time.h>
3#include <stdio.h>
4#include <unistd.h>
5#include <string.h>
6#include <stdlib.h>
7
8void on_good_timer(union sigval sv) {
9}
10
11void on_bad_timer(union sigval sv) {
12 long t = time(NULL);
13 long delta = t - (long)sv.sival_ptr;
14 char message[100];
15 snprintf(message, sizeof(message), "Timer expired after %ld seconds\n", delta);
16 write(STDERR_FILENO, message, strlen(message));
17 abort();
18}
19
20int main(int argc, char** argv) {
21 timer_t id;
22 struct itimerspec its_long = {.it_value = {.tv_sec = 180}};
23 struct itimerspec its_short = {.it_value = {.tv_nsec = 1000}};
24 struct itimerspec its_zero = {.it_interval = {.tv_sec = 0}};
25
26 struct sigevent ev_short = {
27 .sigev_notify = SIGEV_THREAD,
28 .sigev_notify_function = on_good_timer,
29 };
30
31 struct sigevent ev_long = {
32 .sigev_notify = SIGEV_THREAD,
33 .sigev_notify_function = on_bad_timer,
34 };
35
36 for (int which = 0; ; which = !which) {
37 struct sigevent * ev = which ? &ev_short : &ev_long;
38 struct itimerspec * its = which ? &its_short : &its_long;
39
40 ev->sigev_value.sival_ptr = (void*)time(NULL);
41 if (timer_create(CLOCK_MONOTONIC, ev, &id)) {
42 perror("timer_create");
43 continue;
44 }
45 if (timer_settime(id, 0, its, NULL)) {
46 perror("timer_settime(x)");
47 goto cleanup;
48 }
49 if (timer_settime(id, 0, &its_zero, NULL)) {
50 perror("timer_settime(0)");
51 }
52cleanup:
53 if (timer_delete(id)) {
54 perror("timer_delete");
55 }
56 }
57}

I think this is a glibc bug, I don't think the kernel devs would accept it.

Back when this code was was just in LuaSandbox (the profiler there was a prototype for Excimer), we were using signals instead of thread events, and we found out that events can be delivered after timer_delete(). Apparently that is documented behaviour. Brad implemented non-reused integer IDs and a hashtable as a workaround, and I copied that code into Excimer in Excimer's initial commit.

But as we now see, there's no point guarding against event delivery after timer_delete() when you're using SIGEV_THREAD, because those events come via glibc and glibc's protection against this same issue is incomplete.

glibc doesn't just dereference a dangling pointer, it checks that the pointer is still in its list of active timers. But it immediately reuses the pointer values.

Based on code review, the issue is present in glibc's git master.

Change #1133575 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Temporarily disable Lua profiler

https://gerrit.wikimedia.org/r/1133575

Change #1133575 merged by jenkins-bot:

[operations/mediawiki-config@master] Temporarily disable Lua profiler

https://gerrit.wikimedia.org/r/1133575

Mentioned in SAL (#wikimedia-operations) [2025-04-03T00:01:36Z] <tstarling@deploy1003> Started scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-03T00:08:28Z] <tstarling@deploy1003> tstarling: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-03T00:16:41Z] <tstarling@deploy1003> Finished scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] (duration: 15m 04s)

A likely cause for the increase in exception rate in February is this edit. It added a Lua module invocation via {{enum}} to enwiki change tag descriptions. You can see in the logstash records that a lot of backtraces have ChangeTags::getChangeTagList() in them.

There is normally one Lua profiler timer per parse operation, so with main content parsing that's one timer creation per request. But with message parsing, there is one timer creation per message, and there are 115 tag description messages, so 115 timer creations and deletions.

The fix for T344352 was just merged, and that would have helped a lot by avoiding unnecessary description parsing.

Lua is relevant because the bug relates to deletion and recreation of a timer when a kernel timer event is in flight. RequestTimeout can escalate a misdirected event to an exception, but it doesn't generate orphaned in-flight timer events. LuaSandbox, with its 20ms profiling events, was the main source of in-flight events.

With filter exception.class: Wikimedia\RequestTimeout\EmergencyTimeoutException

emergency-exception-rate.png (263×649 px, 20 KB)

This shows the effect of disabling the Lua profiler. There were two exceptions since I disabled it, but the logs seem to suggest that those requests actually ran for 3 minutes.

Still to do:

  • Write a patch for glibc. Their Bugzilla has a lot of tumbleweeds, I don't think bugs just fix themselves.
  • Rewrite the timer backend in LuaSandbox and Excimer.
matmarex renamed this task from Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors to Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors.Mon, Apr 7, 11:19 PM