Page MenuHomePhabricator

Out of memory errors in translatewiki.net from MessageCache deferred update
Closed, ResolvedPublic

Description

[11-Apr-2021 15:28:10 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/includes/libs/objectcache/utils/MemcachedClient.php on line 295
[11-Apr-2021 15:28:10 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/includes/exception/MWExceptionHandler.php on line 302
[11-Apr-2021 16:11:31 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/includes/libs/objectcache/utils/MemcachedClient.php on line 295
[11-Apr-2021 16:11:31 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/includes/exception/MWExceptionHandler.php on line 302
[11-Apr-2021 16:11:31 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/vendor/wikimedia/php-session-serializer/src/Wikimedia/PhpSessionSerializer.php on line 380
[12-Apr-2021 02:21:19 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-04-07_14:43:59/includes/libs/objectcache/utils/MemcachedClient.php on line 295
[2021-04-12T02:21:19.638049+00:00] exception.ERROR: [5cec9597d1c95fe3544437e8] /wiki/Portal:Aln   PHP Fatal Error from line 295 of /srv/mediawiki/tags/2021-04-07_14:43:59/includes/libs/objectcache/utils/MemcachedClient.php: Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) {"exception":"[object] (ErrorException(code: 0): PHP Fatal Error: Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) at /srv/mediawiki/tags/2021-04-07_14:43:59/includes/libs/objectcache/utils/MemcachedClient.php:295)

Since this seems to often happen in memcached, my guess is that perhaps the message cache has become too big. Some additional logging of keys and their sizes may help.

php-wmerrors simple trace:

MemcachedClient.php:295
MemcachedClient.php:1023
MemcachedClient.php:503
MemcachedPhpBagOStuff.php:65
MediumSpecificBagOStuff.php:287
MediumSpecificBagOStuff.php:267
WANObjectCache.php:901
WANObjectCache.php:1746
WANObjectCache.php:2725
MWCallableUpdate.php:38
DeferredUpdates.php:513
DeferredUpdates.php:390
DeferredUpdates.php:221
DeferredUpdatesScope.php:267
DeferredUpdatesScope.php:196
DeferredUpdates.php:242
MediaWiki.php:1062
MediaWiki.php:821
MediaWiki.php:833
MediaWiki.php:571
index.php:53
index.php:46

Details

Event Timeline

Do you have a stack trace for this? That woudl help narrow down if it's MessageCache or session storage, and if MessageCache, whether it's the localCluster use of Memc or the wanCache use of Memc.

Do you have a stack trace for this? That woudl help narrow down if it's MessageCache or session storage, and if MessageCache, whether it's the localCluster use of Memc or the wanCache use of Memc.

We don't have a trace. If I am not mistaken it's not possible to get stack traces for exceeding the memory limit without xdebug, is that right? Is there way to configure xdebug in production in a way that doesn't significantly degrade performance?

Do you have a stack trace for this? That woudl help narrow down if it's MessageCache or session storage, and if MessageCache, whether it's the localCluster use of Memc or the wanCache use of Memc.

We don't have a trace. If I am not mistaken it's not possible to get stack traces for exceeding the memory limit without xdebug, is that right? Is there way to configure xdebug in production in a way that doesn't significantly degrade performance?

MediaWiki reports traces from all fatal errors afaik, including OOMs. They should end up in exception.log the same as anything else. There is a special amount of memory reserved in MWExceptionHandler just for that. If for some reason php-fpm crashes while or before PHP is able to invoke the shutdown function (which afaik normally still happens after an OOM), then you may need to install php-wmerrors to capture them.

Most of the time there is no trace. If there is, it looks broken like this:

[2021-08-23T21:00:48.704485+00:00] exception.ERROR: [253141a625c71eb8863bfe59] /wiki/Portal:Nn   PHP Fatal Error from line 295 of /srv/mediawiki/tags/2021-08-18_12:50:37/includes/libs/objectcache/utils/MemcachedClient.php: Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) {"exception":"[object] (ErrorException(code: 0): PHP Fatal Error: Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) at /srv/mediawiki/tags/2021-08-18_12:50:37/includes/libs/objectcache/utils/MemcachedClient.php:295)
[stacktrace]
#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}
","exception_url":"/wiki/Portal:Nn","reqId":"253141a625c71eb8863bfe59","caught_by":"mwe_handler"} []

Does that mean that the exception handler itself is running out of memory?

I tried to increase the amount of reserved memory from 16k to 4*16k but it didn't change anything:

[31-Aug-2021 07:05:34 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-08-26_05:47:08/includes/libs/objectcache/utils/MemcachedClient.php on line 295
[31-Aug-2021 07:05:34 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-08-26_05:47:08/includes/exception/MWExceptionHandler.php online 302
[31-Aug-2021 11:35:27 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-08-26_05:47:08/includes/libs/objectcache/utils/MemcachedClient.php on line 295
[31-Aug-2021 11:35:27 UTC] PHP Fatal error:  Allowed memory size of 314572800 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/tags/2021-08-26_05:47:08/includes/debug/logger/MonologSpi.php on line 198

@Krinkle It feels like the memory reservation mechanism is not working?

@Nikerabbit When anything comes through to exception.log, that means the reserved memory is working. And the URL and reqId are still more than php-fpm provided, but.. I forgot that there are no (meaningful) stack traces for unrecoverable fatal errors. The only way at run-time to report these is from a shutdown function (handleFatalError) which is called by the engine after the callstack has unwound and started anew.

When we get OOMs in production (Logstash), the ones logged from MW are indeed logged post-shutdown there as well with the same short post-shutdown trace. But.. each entry in prod has a sibling entry logged by php-wmerrors which does have a full stack trace from where the OOM happened. That'd be my recommendation for tracing the callsites that go OOM.

It seems easiest way to get wmerrors is updating to Debian bullseye: https://packages.debian.org/bullseye/php-wmerrors

Tentatively making this task dependent on that.

Change 723427 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[translatewiki@master] Add php-wmerrors

https://gerrit.wikimedia.org/r/723427

Change 723427 merged by jenkins-bot:

[translatewiki@master] Add php-wmerrors

https://gerrit.wikimedia.org/r/723427

WANObjectCache.php:2725 is inside method scheduleAsyncRefresh. I am not sure what creates this deferred update, as only other thing mentioning this method is fetchOrRegenerate in the same file, and I do not see that in the call stack.

Not sure if T288702: Reduce complexity and time spent in WANObjectCache will help, we are planning to upgrade back to master branch next week (currently on 1.36).

In any case, it is failing on unserialize().

WANObjectCache.php:2725 is inside method scheduleAsyncRefresh. I am not sure what creates this deferred update, as only other thing mentioning this method is fetchOrRegenerate in the same file, and I do not see that in the call stack.

When regular code calls WANCache->getWithSet, and there is a cache hit, and there is less than 1 minute left before the key expires, then there is a random chance that WANCache will schedule a deferred update for you that regenerates the key.

Any error relating to this is most likely unrelated to WANCache and the same would happen when the given callback is called during a normal cache miss, it just happens to be invoked via a deferred update.

My debugging of large values to unserialize has not been helpful so far. I only know that the value is less than 1M serialized.

[2021-10-24T11:41:02.684463+00:00] Translate.INFO: Trying to unserialize WANCache:translatewiki_net-bw_:preprocess-hash:523f38e50882fa9d9fec25c2ae46b622375837c0:1|#|v with value of size 723645 {"private":false} []
[2021-10-24T11:41:02.718549+00:00] Translate.INFO: Trying to unserialize WANCache:translatewiki_net-bw_:preprocess-hash:881b2da6f8b12cca58a8090bbd5db6c40bb9328d:1|#|v with value of size 3833691 {"private":false} []
[2021-10-24T11:41:03.093884+00:00] Translate.INFO: Trying to unserialize WANCache:translatewiki_net-bw_:translate-groups:v4|#|v with value of size 1082825 {"private":false} []
[2021-10-24T11:41:03.335421+00:00] Translate.INFO: Trying to unserialize WANCache:translatewiki_net-bw_:preprocess-hash:881b2da6f8b12cca58a8090bbd5db6c40bb9328d:1|#|v with value of size 3833691 {"private":false} []
[2021-10-24T11:41:03.700277+00:00] Translate.INFO: Trying to unserialize WANCache:translatewiki_net-bw_:preprocess-hash:881b2da6f8b12cca58a8090bbd5db6c40bb9328d:1|#|v with value of size 3833691 {"private":false} []
[2021-10-24T11:41:03.744788+00:00] "Error"

preprocess-hash:881b2da6f8b12cca58a8090bbd5db6c40bb9328d refers to Template:Languagename, which was 200 KB page full of switch statements before I slightly simplified it. I am not sure if this is the cause, but unserializing that huge blob twice surely takes some memory.

After my changes the size almost halved:
Trying to unserialize WANCache:translatewiki_net-bw_:preprocess-hash:51d294a3672b571c31ebde8896ab0ab0e80c2a1c:1|#|v with value of size 2025900 {"private":false} []

Krinkle renamed this task from Out of memory errors in translatewiki.net to Out of memory errors in translatewiki.net from MessageCache deferred update.Jan 21 2022, 9:48 PM
Krinkle moved this task from Backlog to MessageCache on the MediaWiki-Internationalization board.

$wgMemoryLimit can be increased. I don’t have a precise method to set it to a sensible value, except try-observe-adjust.

See also mw:Manual:Common errors and symptoms#Fatal error: Allowed memory size of X bytes exhausted (tried to allocate Y bytes).

We have split Template:Languagename

Nikerabbit claimed this task.

Fine by me. May reopen if this issue reoccurs.