Page MenuHomePhabricator

Error: Class 'Cdb\Exception' not found
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Error: Class 'Cdb\Exception' not found
exception.trace
from /srv/mediawiki/php-1.41.0-wmf.16/vendor/wikimedia/cdb/src/Reader/DBA.php(31)
#0 /srv/mediawiki/php-1.41.0-wmf.16/vendor/wikimedia/cdb/src/Reader.php(41): Cdb\Reader\DBA->__construct(string)
#1 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LCStoreCDB.php(61): Cdb\Reader::open(string)
#2 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LocalisationCache.php(430): LCStoreCDB->get(string, string)
#3 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LocalisationCache.php(327): LocalisationCache->loadItem(string, string)
#4 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LocalisationCache.php(552): LocalisationCache->getItem(string, string)
#5 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LocalisationCache.php(450): LocalisationCache->initLanguage(string)
#6 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/LocalisationCache.php(349): LocalisationCache->loadSubitem(string, string, string)
#7 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(728): LocalisationCache->getSubitem(string, string, string)
#8 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(1298): MessageCache->isMainCacheable(string, string)
#9 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(1189): MessageCache->getMsgFromNamespace(string, string)
#10 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(1160): MessageCache->getMessageForLang(LanguageEn, string, boolean, array)
#11 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/MessageCache.php(1058): MessageCache->getMessageFromFallbackChain(LanguageEn, string, boolean)
#12 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(1476): MessageCache->get(string, boolean, LanguageEn)
#13 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(971): Message->fetchMessage()
#14 /srv/mediawiki/php-1.41.0-wmf.16/includes/language/Message.php(1065): Message->format(string)
#15 /srv/mediawiki/php-1.41.0-wmf.16/includes/Status.php(213): Message->plain()
#16 /srv/mediawiki/php-1.41.0-wmf.16/extensions/PageViewInfo/includes/WikimediaPageViewService.php(322): Status->getWikiText(boolean, boolean, string)
#17 /srv/mediawiki/php-1.41.0-wmf.16/extensions/PageViewInfo/includes/WikimediaPageViewService.php(131): MediaWiki\Extension\PageViewInfo\WikimediaPageViewService->makeRequest(string)
#18 /srv/mediawiki/php-1.41.0-wmf.16/extensions/PageViewInfo/includes/CachedPageViewService.php(198): MediaWiki\Extension\PageViewInfo\WikimediaPageViewService->getPageData(array, integer, string)
#19 /srv/mediawiki/php-1.41.0-wmf.16/extensions/PageViewInfo/includes/CachedPageViewService.php(62): MediaWiki\Extension\PageViewInfo\CachedPageViewService->getTitlesWithCache(string, array)
#20 /srv/mediawiki/php-1.41.0-wmf.16/extensions/GrowthExperiments/includes/UserImpact/ComputedUserImpactLookup.php(438): MediaWiki\Extension\PageViewInfo\CachedPageViewService->getPageData(array, integer)
#21 /srv/mediawiki/php-1.41.0-wmf.16/extensions/GrowthExperiments/includes/UserImpact/ComputedUserImpactLookup.php(386): GrowthExperiments\UserImpact\ComputedUserImpactLookup->getPageViewDataInJobContext(array, User, integer)
#22 /srv/mediawiki/php-1.41.0-wmf.16/extensions/GrowthExperiments/includes/UserImpact/ComputedUserImpactLookup.php(184): GrowthExperiments\UserImpact\ComputedUserImpactLookup->getPageViewData(User, array, array, integer)
#23 /srv/mediawiki/php-1.41.0-wmf.16/extensions/GrowthExperiments/includes/UserImpact/RefreshUserImpactJob.php(156): GrowthExperiments\UserImpact\ComputedUserImpactLookup->getExpensiveUserImpact(User)
#24 /srv/mediawiki/php-1.41.0-wmf.16/extensions/GrowthExperiments/includes/UserImpact/RefreshUserImpactJob.php(110): GrowthExperiments\UserImpact\RefreshUserImpactJob->computeUserImpact(integer)
#25 /srv/mediawiki/php-1.41.0-wmf.16/extensions/EventBus/includes/JobExecutor.php(78): GrowthExperiments\UserImpact\RefreshUserImpactJob->run()
#26 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#27 {main}
Impact

RefreshUserImpactJob fails and doesn't update the user impact, potentially resulting in outdated entries displaying in Special:Homepage / Special:Impact.

Notes

Details

Request URL
https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The exception that was supposed to be thrown is Unable to open CDB file. There is no obvious reason why the exception class couldn't be loaded - CDB code hasn't changed in a long time, the class is at the right place, in my local setup the Composer class does include it. Also no obvious reason while the file couldn't load. So I'd assume some sort of deployment hiccup where some of the MediaWiki files were missing from some job runner.

Krinkle subscribed.

It happens on more servers, and more wikis, and over a larger time period than seems explained by a mere deployment race condition.

Screenshot 2023-07-13 at 16.18.19.png (966×2 px, 180 KB)

Screenshot 2023-07-13 at 16.15.10.png (1×2 px, 454 KB)

Assigning to myself for initial investigation.

Initial investigation shows that this happens relatively rarely, but exclusively as part of RefreshUserImpactJob in GrowthExperiments. This might mean the issue is in the way how GrowthExperiments calls PageViewInfo, but judging by CodeSearch, the more likely explanation is that no other extension calls PageViewInfo in a job context.

Checking the logs at mwlog1002, I can see that this exception is always followed by a PHP Fatal Error. Failed opening required '/srv/mediawiki/php-1.41.0-wmf.18/includes/libs/HttpStatus.php'. By looking at the file, I can see no obvious reason why it should fail this way.

Around the exception, I can also see a bunch of errors in PageViewInfo channel, around the lines of Error fetching URL: Could not resolve host: localhost and There was a problem during the HTTP request: 503 Service Unavailable (LogStash view). I do not understand why localhost would be unresolvable. Those errors are much more frequent and colerate with the Cdb exception reported in this task.

I see no obvious patterns in terms of which servers it impacts; I think the exception is related to cases when the service proxy is not working properly, which appears to happen fairly regularly for some reason.

Urbanecm_WMF moved this task from Inbox to Triaged on the Growth-Team board.

Most of these errors (failing to load the Exception and HttpStatus classes, failing to open the CDB file) are about file access. The remaining ones are about internet connections, which are sometimes handled in a way similar to files (e.g. sockets as virtual filesystem entries). Given that the issue seems to be limited to a job which does a ton of HTTP requests to AQS, it's probably overrunning some sort of quota that applies to open files + connections?

Could not resolve host: localhost is only happening on mw1437 (the jobrunner canary) so maybe that's a different issue.

Most of these errors […] are about file access. […] it's probably overrunning some sort of quota that applies to open files + connections?

Nice catch, and yes, I believe this is exactly what's happening. See also T230245, where a maintenance scripts that generates captcha images, gets a fatal error:

Fatal error: require(/srv/mediawiki/php-1.34.0-wmf.8/includes/json/FormatJson.php): File not found in /srv/mediawiki/php-1.34.0-wmf.8/includes/AutoLoader.php on line 109

Notably, this fatal error happens as part of error reporting, after it fails to upload a file to Swift over HTTP. Both are happening for the same reason — EMFILE (Too many open files), which is a system-level restriction typically set by the operating system.

In case of T230245 the problem was that there was no concurrency limit set. It was correctly uploading things in batches, but it was preparing the temp files all in parallel and keeping the file handles open.

I think the requests are not parallel (but Guzzle certainly has the ability to do them in parallel and it comes down to request configuration, which can get injected in a number of different ways, so I might have misread the code). It's also possible some HTTP-related code is missing a close() call.

It's also possible some HTTP-related code is missing a close() call.

Guzzle seems to handle that internally (I think these requests end up with CurlHandler and then CurlFactory::release()). It reuses curl handles and only closes open handles once it's above its max handle limit but that's very low (3 if I read the code correctly), so that's probably not it.

As part of running the train I found some errors coming from the PageViewInfo job which somehow have the autoloader unable to open includes/libs/rdbms/exception/DBConnectionError.php or includes/libs/HttpStatus.php. I have filed it as T348614

The trace for the missing includes/libs/HttpStatus.php also have errors Class 'Cdb\Exception' not found which points me back to this task.

Urbanecm_WMF claimed this task.

The error stopped happening. Please see the parent task for more information.