Page MenuHomePhabricator

OOM from Babel loading messages from too many languages (getSubitemList)
Closed, ResolvedPublic

Description

The following errors started to happen the 4th March around 12:00 and 15:00 the big raise:
https://logstash.wikimedia.org/goto/8d81d3f60f306106d8b954797e6b4af7

level	   	ERR
logsource	   	mw1290
message	   	PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 1130496 bytes) in /srv/mediawiki/php-1.35.0-wmf.22/vendor/wikimedia/cdb/src/Reader/DBA.php on line 48
normalized_message	   	PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 1130496 bytes) in /srv/mediawiki/php-1.35.0-wmf.22/vendor/wikimedia/cdb/src/Reader/DBA.php on line 48

Event Timeline

The stack trace on those jobs aren't the most useful.... You can't obviously tell which CDB it's trying to read :(

Avoid the syslog fallback entries, prefer the same entries from type:mediawik instead, which do have stacktraces.

https://logstash.wikimedia.org/goto/4e7d6782137ddb950f51247ebf18f01f

Screenshot 2020-03-11 at 17.13.02.png (982×2 px, 170 KB)

It looks most are from api.php requests on meta.wikimedia.org.

The below trace involves the wikitext Parser calling out to Babel and Translate extensions which then invoke the LocalisationCache where CDB runs out of memory.

#0 [internal function]: unknown()
#1 /srv/mediawiki/php-1.35.0-wmf.22/vendor/wikimedia/cdb/src/Reader/DBA.php(48): dba_fetch()
#2 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/localisation/LCStoreCDB.php(71): Cdb\Reader\DBA->get()
#3 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/localisation/LocalisationCache.php(386): LCStoreCDB->get()
#4 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/localisation/LocalisationCache.php(400): LocalisationCache->loadItem()
#5 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/localisation/LocalisationCache.php(325): LocalisationCache->loadSubitem()
#6 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/localisation/LocalisationCache.php(345): LocalisationCache->getSubitem()
#7 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/MessageCache.php(283): LocalisationCache->getSubitemList()
#8 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/MessageCache.php(1083): MessageCache->load()
#9 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/MessageCache.php(1010): MessageCache->getMsgFromNamespace()
#10 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/MessageCache.php(980): MessageCache->getMessageForLang()
#11 /srv/mediawiki/php-1.35.0-wmf.22/includes/cache/MessageCache.php(922): MessageCache->getMessageFromFallbackChain()
#12 /srv/mediawiki/php-1.35.0-wmf.22/includes/language/Message.php(1299): MessageCache->get()
#13 /srv/mediawiki/php-1.35.0-wmf.22/includes/language/Message.php(854): Message->fetchMessage()
#14 /srv/mediawiki/php-1.35.0-wmf.22/includes/language/Message.php(958): Message->toString()
#15 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Translate/TranslateHooks.php(517): Message->plain()
#16 /srv/mediawiki/php-1.35.0-wmf.22/includes/Hooks.php(174): TranslateHooks::translateMessageDocumentationLanguage()
#17 /srv/mediawiki/php-1.35.0-wmf.22/includes/Hooks.php(202): Hooks::callHook()
#18 /srv/mediawiki/php-1.35.0-wmf.22/includes/language/LanguageNameUtils.php(206): Hooks::run()
#19 /srv/mediawiki/php-1.35.0-wmf.22/includes/language/LanguageNameUtils.php(184): MediaWiki\Languages\LanguageNameUtils->getLanguageNamesUncached()
#20 /srv/mediawiki/php-1.35.0-wmf.22/languages/Language.php(891): MediaWiki\Languages\LanguageNameUtils->getLanguageNames()
#21 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Babel/includes/BabelLanguageCodes.php(106): Language::fetchLanguageNames()
#22 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Babel/includes/BabelBox/LanguageBabelBox.php(85): BabelLanguageCodes::getName()
#23 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Babel/includes/Babel.php(188): MediaWiki\Babel\BabelBox\LanguageBabelBox->render()
#24 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Babel/includes/Babel.php(107): Babel::mGenerateContent()
#25 /srv/mediawiki/php-1.35.0-wmf.22/extensions/Babel/includes/Babel.php(45): Babel::mGenerateContentTower()
#26 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(3604): Babel::Render()
#27 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(3308): Parser->callParserFunction()
#28 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/PPFrame_Hash.php(253): Parser->braceSubstitution()
#29 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/PPTemplateFrame_Hash.php(89): PPFrame_Hash->expand()
#30 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(3483): PPTemplateFrame_Hash->cachedExpand()
#31 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/PPFrame_Hash.php(253): Parser->braceSubstitution()
#32 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(3149): PPFrame_Hash->expand()
#33 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(1483): Parser->replaceVariables()
#34 /srv/mediawiki/php-1.35.0-wmf.22/includes/parser/Parser.php(599): Parser->internalParse()
#35 /srv/mediawiki/php-1.35.0-wmf.22/includes/content/WikitextContent.php(368): Parser->parse()
#36 /srv/mediawiki/php-1.35.0-wmf.22/includes/content/AbstractContent.php(565): WikitextContent->fillParserOutput()
#37 /srv/mediawiki/php-1.35.0-wmf.22/includes/api/ApiParse.php(271): AbstractContent->getParserOutput()
#38 /srv/mediawiki/php-1.35.0-wmf.22/includes/api/ApiMain.php(1590): ApiParse->execute()
#39 /srv/mediawiki/php-1.35.0-wmf.22/includes/api/ApiMain.php(522): ApiMain->executeAction()
#40 /srv/mediawiki/php-1.35.0-wmf.22/includes/api/ApiMain.php(493): ApiMain->executeActionWithErrorHandling()
#41 /srv/mediawiki/php-1.35.0-wmf.22/api.php(84): ApiMain->execute()
#42 /srv/mediawiki/w/api.php(3): require()

Untagging CDB as this is (most likely) not a bug with the CDB reader software itself. Tagging Babel, Translate and core LocalisationCache instead which are more likely suspects.

Reading out a localisation file with CDB takes up a certain amount of memory, but not so much that it would OOM by itself (if that were the case, all of prod would be OOM'ing which it isn't by a long shot). As such, it is also entirely possible that something else in these Meta-Wiki API requests between Babel and Translate is consuming a lot more memory than before (arguably too much), and then the actual OOM might just happen here in the CDB call where it reaches the final threshold.

Not sure who to tag. Falling back to CPT for intitial triage, possibly investigation as well?

Krinkle renamed this task from PHP Fatal error: Allowed memory size bytes exhausted in /srv/mediawiki/php-1.35.0-wmf.22/vendor/wikimedia/cdb/src/Reader/DBA.php on line 48 to Meta-Wiki API reqests fatal OOM: Allowed memory size exhausted in cdb/src/Reader/DBA.php.Mar 11 2020, 5:20 PM

Untagging CDB as this is (most likely) not a bug with the CDB reader software itself. Tagging Babel, Translate and core LocalisationCache instead which are more likely suspects.

Reading out a localisation file with CDB takes up a certain amount of memory, but not so much that it would OOM by itself (if that were the case, all of prod would be OOM'ing which it isn't by a long shot). As such, it is also entirely possible that something else in these Meta-Wiki API requests between Babel and Translate is consuming a lot more memory than before (arguably too much), and then the actual OOM might just happen here in the CDB call where it reaches the final threshold.

As T207055: Replace Babel CDB files with static array files is now closed, Babel doesn't use CDB anymore... And the patch is in .22 and .23 already, so I'm guessing it's probably not Babel (directly) either...

Aklapper renamed this task from Meta-Wiki API reqests fatal OOM: Allowed memory size exhausted in cdb/src/Reader/DBA.php to Meta-Wiki API requests fatal OOM: Allowed memory size exhausted in cdb/src/Reader/DBA.php.Mar 11 2020, 7:19 PM

Translate doesn't use CDB either in Wikimedia production, but that isn't relevant. The main things I would suggest to investigate are:

  • Is the page having babel templates for many many languages?
  • Is MediaWiki\Languages\LanguageNameUtils->getLanguageNamesUncached() for some reason looping over all languages, causing Translate to trigger loading of localisation cache for all those languages to give the translated language name for qqq in those languages?
  • Is the page having babel templates for many many languages?

Looks like yes, both https://meta.wikimedia.org/wiki/User:%D7%91%D7%B2%D6%B7_%D7%9E%D7%99%D7%A8_%D7%91%D7%99%D7%A1%D7%98%D7%95_%D7%A9%D7%99%D7%99%D7%9F?action=edit and https://meta.wikimedia.org/wiki/User:Gangleri?action=edit at least contain something along the lines of

{{#babel:aa-0| ab-0| ace-0| af-0| ak-0| als-0| am-0| an-0| ang-0| ar-0| arc-0| arz-0| as-0| ast-0| av-0| ay-0| az-0| azb-0| ba-0| bar-0| bat-smg-0| bcl-0| be-0| be-x-old-0| bg-0| bh-0| bi-0| bjn-0| bm-0| bn-0| bo-0| bpy-0| br-0| bs-0| bug-0| bxr-0| ca-0| cbk-zam-0| cdo-0| ce-0| ceb-0| ch-0| cho-0| chr-0| chy-0| ckb-0| co-0| cr-0| crh-0| cs-0| csb-0| cu-0| cv-0| cy-0| da-0| de-0| diq-0| dsb-0| dv-0| dz-0| ee-0| el-0| eml-0| en-0| eo-0| es-0| et-0| eu-0| ext-0| fa-0| ff-0| fi-0| fiu-vro-0| fj-0| fo-0| fr-0| frp-0| frr-0| fur-0| fy-0| ga-0| gag-0| gan-0| gd-0| gl-0| glk-0| gn-0| gom-0| got-0| gu-0| gv-0| ha-0| hak-0| haw-0| he-0| hi-0| hif-0| ho-0| hr-0| hsb-0| ht-0| hu-0| hy-0| hz-0| ia-0| id-0| ie-0| ig-0| ii-0| ik-0| ilo-0| io-0| is-0| it-0| iu-0| ja-0| jbo-0| jv-0| ka-0| kaa-0| kab-0| kbd-0| kg-0| ki-0| kj-0| kk-0| kl-0| km-0| kn-0| ko-0| koi-0| kr-0| krc-0| ks-0| ksh-0| ku-0| kv-0| kw-0| ky-0| la-0| lad-0| lb-0| lbe-0| lez-0| lg-0| li-0| lij-0| lmo-0| ln-0| lo-0| lrc-0| lt-0| ltg-0| lv-0| mai-0| map-bms-0| mdf-0| mg-0| mh-0| mhr-0| mi-0| min-0| mk-0| ml-0| mn-0| mo-0| mr-0| mrj-0| ms-0| mt-0| mus-0| mwl-0| my-0| myv-0| mzn-0| na-0| nah-0| nap-0| nds-0| nds-nl-0| ne-0| new-0| ng-0| nl-0| nn-0| no-0| nov-0| nrm-0| nso-0| nv-0| ny-0| oc-0| om-0| or-0| os-0| pa-0| pag-0| pam-0| pap-0| pcd-0| pdc-0| pfl-0| pi-0| pih-0| pl-0| pms-0| pnb-0| pnt-0| ps-0| pt-0| qu-0| rm-0| rmy-0| rn-0| ro-0| roa-rup-0| roa-tara-0| ru-0| rue-0| rw-0| sa-0| sah-0| sc-0| scn-0| sco-0| sd-0| se-0| sg-0| sh-0| si-0| simple-0| sk-0| sl-0| sm-0| sn-0| so-0| sq-0| sr-0| srn-0| ss-0| st-0| stq-0| su-0| sv-0| sw-0| szl-0| ta-0| te-0| tet-0| tg-0| th-0| ti-0| tk-0| tl-0| tn-0| to-0| tpi-0| tr-0| ts-0| tt-0| tum-0| tw-0| ty-0| tyv-0| udm-0| ug-0| uk-0| ur-0| uz-0| ve-0| vec-0| vep-0| vi-0| vls-0| vo-0| wa-0| war-0| wo-0| wuu-0| xal-0| xh-0| xmf-0| yi-0| yo-0| za-0| zea-0| zh-0| zh-classical-0| zh-min-nan-0| zh-yue-0| zu-0}}

Previewing with that present gives an OOM, while previewing with it removed renders normally.

Further, copying just that to a sandbox page gives an OOM when previewed.

So far I've tracked it down to the fact that Babel winds up doing wfMessage( ... )->inLanguage( $code )->text() for each of those codes, which winds up creating a Language object for each code and loading something that's apparently in-process cached without a limit.

This seems to be confirmed by using shell.php to do something similar:

>>> $m0 = memory_get_usage(); foreach ( [ 'aa','ab','ace','af','ak','als','am','an','ang','ar','arc','arz','as','ast','av','ay','az','azb','ba','bar','bat-smg','bcl','be','be-x-old','bg','bh','bi','bjn','bm','bn','bo','bpy','br','bs','bug','bxr','ca','cbk-zam','cdo','ce','ceb','ch','cho','chr','chy','ckb','co','cr','crh','cs','csb','cu','cv','cy','da','de','diq','dsb','dv','dz','ee','el','eml','en','eo','es','et','eu','ext','fa','ff','fi','fiu-vro','fj','fo','fr','frp','frr','fur','fy','ga','gag','gan','gd','gl','glk','gn','gom','got','gu','gv','ha','hak','haw','he','hi','hif','ho','hr','hsb','ht','hu','hy','hz','ia','id','ie','ig','ii','ik','ilo','io','is','it','iu','ja','jbo','jv','ka','kaa','kab','kbd','kg','ki','kj','kk','kl','km','kn','ko','koi','kr','krc','ks','ksh','ku','kv','kw','ky','la','lad','lb','lbe','lez','lg','li','lij','lmo','ln','lo','lrc','lt','ltg','lv','mai','map-bms','mdf','mg','mh','mhr','mi','min','mk','ml','mn','mo','mr','mrj','ms','mt','mus','mwl','my','myv','mzn','na','nah','nap','nds','nds-nl','ne','new','ng','nl','nn','no','nov','nrm','nso','nv','ny','oc','om','or','os','pa','pag','pam','pap','pcd','pdc','pfl','pi','pih','pl','pms','pnb','pnt','ps','pt','qu','rm','rmy','rn','ro','roa-rup','roa-tara','ru','rue','rw','sa','sah','sc','scn','sco','sd','se','sg','sh','si','simple','sk','sl','sm','sn','so','sq','sr','srn','ss','st','stq','su','sv','sw','szl','ta','te','tet','tg','th','ti','tk','tl','tn','to','tpi','tr','ts','tt','tum','tw','ty','tyv','udm','ug','uk','ur','uz','ve','vec','vep','vi','vls','vo','wa','war','wo','wuu','xal','xh','xmf','yi','yo','za','zea','zh','zh-classical','zh-min-nan','zh-yue','zu' ] as $code ) { Language::fetchLanguageNames( $code, 'all' ); } echo memory_get_usage() - $m0;
22716840⏎

>>> $m0 = memory_get_usage(); foreach ( [ 'aa','ab','ace','af','ak','als','am','an','ang','ar','arc','arz','as','ast','av','ay','az','azb','ba','bar','bat-smg','bcl','be','be-x-old','bg','bh','bi','bjn','bm','bn','bo','bpy','br','bs','bug','bxr','ca','cbk-zam','cdo','ce','ceb','ch','cho','chr','chy','ckb','co','cr','crh','cs','csb','cu','cv','cy','da','de','diq','dsb','dv','dz','ee','el','eml','en','eo','es','et','eu','ext','fa','ff','fi','fiu-vro','fj','fo','fr','frp','frr','fur','fy','ga','gag','gan','gd','gl','glk','gn','gom','got','gu','gv','ha','hak','haw','he','hi','hif','ho','hr','hsb','ht','hu','hy','hz','ia','id','ie','ig','ii','ik','ilo','io','is','it','iu','ja','jbo','jv','ka','kaa','kab','kbd','kg','ki','kj','kk','kl','km','kn','ko','koi','kr','krc','ks','ksh','ku','kv','kw','ky','la','lad','lb','lbe','lez','lg','li','lij','lmo','ln','lo','lrc','lt','ltg','lv','mai','map-bms','mdf','mg','mh','mhr','mi','min','mk','ml','mn','mo','mr','mrj','ms','mt','mus','mwl','my','myv','mzn','na','nah','nap','nds','nds-nl','ne','new','ng','nl','nn','no','nov','nrm','nso','nv','ny','oc','om','or','os','pa','pag','pam','pap','pcd','pdc','pfl','pi','pih','pl','pms','pnb','pnt','ps','pt','qu','rm','rmy','rn','ro','roa-rup','roa-tara','ru','rue','rw','sa','sah','sc','scn','sco','sd','se','sg','sh','si','simple','sk','sl','sm','sn','so','sq','sr','srn','ss','st','stq','su','sv','sw','szl','ta','te','tet','tg','th','ti','tk','tl','tn','to','tpi','tr','ts','tt','tum','tw','ty','tyv','udm','ug','uk','ur','uz','ve','vec','vep','vi','vls','vo','wa','war','wo','wuu','xal','xh','xmf','yi','yo','za','zea','zh','zh-classical','zh-min-nan','zh-yue','zu' ] as $code ) { @wfMessage( "mainpage" )->inLanguage( $code )->text(); } echo memory_get_usage() - $m0;
743481080⏎

Fetching the language names for each language eats 22M, while parsing one simple message eats another 743M. Locally without Translate installed (and without a bunch of other things installed too, and different configuration for l10n cache), I get 440K for fetching the names and another 495M for the message parsing.

It's not the Language objects themselves, locally when I added some debugging to Language::__destruct() it showed expected destructions occurring.

So far all the older versions of core and of Babel that I've tried have still been subject to OOM when trying to parse the wikitext in T247223#5972938, which makes me suspect this is due to someone starting to repeatedly try to parse these pages rather than to a recent code change.

I need to move on to looking at other things now, but hopefully this helps others in further tracking down the underlying issue here.

T43103: initialization of the Language object is very heavy supports the notion that this problem has existed for a long time. Like said here and there, it's not the Language object itself, but maybe something related to caching of messages (either the l10n cache, which recently moved from CDB to PHP, or the message cache).

I don't that closing this as duplicate makes sense. This bug has more information and has the Wikimedia-production-error tag which the other one doesn't.

Krinkle reopened this task as Open.EditedApr 1 2021, 9:47 PM

Indeed, not a duplicate. So long as holding language objects is expensive, it is the responsibility of callers not to create many of them in the same request. That appears to have happened in the case of Babel, which this task is about. If making them cheaper is considered the more feasible or worthwhile fix for Babel, then I suppose we could make T43103 a subtask.

Krinkle renamed this task from Meta-Wiki API requests fatal OOM: Allowed memory size exhausted in cdb/src/Reader/DBA.php to Some Meta-Wiki user pages fatal due to OOM from Babel loading too many localisation caches (from cdb/src/Reader/DBA.php).Apr 1 2021, 9:48 PM
daniel subscribed.

Another quick look at this issue indicates that it's probably due to loading all names of all languages in all languages from CLDR. Needs some more investigation, so I'm scheduling this for roadmapping.

No I don't think it has anything to do with CLDR, which doesn't even use CDB files.

Like Anomie wrote before, it's most likely a call of wfMessage( .. )->inLanguage( X ) being repeated for many different values of X. There are such calls at least in Translate (see T247223#5960855). This triggers loading of the localisation cache (and message cache) for many languages and that causes memory limit to be exceeded. This seems weird because we are just asking one message for each language, and CDB files are supposed to allow efficient access to single keys.

I reproduced this locally. It's necessary to set $wgLocalisationCacheConf['manualRecache'] = true;, otherwise memory usage may be dominated by recache(), which is not an issue that affects production.

One issue, seen in the backtrace above, is that BabelLanguageCodes::getName() is called with $language = null, meaning an autonym, but it then calls Language::fetchLanguageNames() with the specific desired code instead of LanguageNameUtils::AUTONYMS, figuring that the end result will still be the same since it throws away all the non-autonym results. That's what triggers Translate to load messages in many different languages.

But with that fixed, I still see an OOM due to the wfMessage(...)->inLanguage($code)->text() calls in LanguageBabelBox::getText().

I noticed that, from eval.php, a loop of wfMessage('babel-0')->inLanguage($code)->text() uses a large amount of memory, whereas a loop of LocalisationCache::getSubitem($code, 'messages', 'babel-0') does not. The culprit seems to be getSubitemList(), which is also indicated by the backtrace above. A loop of getSubitemList() does use a lot of memory. The array has about 10,000 elements per language.

I'll update the task description per this new theory.

tstarling renamed this task from Some Meta-Wiki user pages fatal due to OOM from Babel loading too many localisation caches (from cdb/src/Reader/DBA.php) to OOM from Babel loading messages from too many languages (getSubitemList).May 11 2021, 5:52 AM

Change 688915 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Get rid of the cache split in MessageCache::isMainCacheable()

https://gerrit.wikimedia.org/r/688915

Change 688915 merged by jenkins-bot:

[mediawiki/core@master] Optimise MessageCache::isMainCacheable() for the single-message case

https://gerrit.wikimedia.org/r/688915

tstarling claimed this task.

Change 698538 had a related patch set uploaded (by Reedy; author: Tim Starling):

[mediawiki/core@REL1_36] Optimise MessageCache::isMainCacheable() for the single-message case

https://gerrit.wikimedia.org/r/698538

Change 698539 had a related patch set uploaded (by Reedy; author: Tim Starling):

[mediawiki/core@REL1_35] Optimise MessageCache::isMainCacheable() for the single-message case

https://gerrit.wikimedia.org/r/698539

Change 698539 merged by jenkins-bot:

[mediawiki/core@REL1_35] Optimise MessageCache::isMainCacheable() for the single-message case

https://gerrit.wikimedia.org/r/698539

Change 698538 merged by jenkins-bot:

[mediawiki/core@REL1_36] Optimise MessageCache::isMainCacheable() for the single-message case

https://gerrit.wikimedia.org/r/698538