Page MenuHomePhabricator

"Lua error: not enough memory" on certain en.wiktionary pages
Closed, ResolvedPublic

Description

ISSUE:
Several users and I see "Lua error: not enough memory" in "water" when all its content is present, as in https://en.wiktionary.org/w/index.php?title=water&oldid=42726961 , and recently also in "man" https://en.wiktionary.org/w/index.php?title=man&oldid=42871191 . Errors persist despite efforts to simplify Lua-using templates in ways that do not reduce functionality.

The temporary workaround has been to split "water" onto two pages, and disable script support and transliteration of translations of "man"; this is undesirable because it reduces functionality. Many entries will one day be as complete as "water", so the issue will effect more pages if the causes are not addressed. (Indeed, in the last week another page developed an error.)

POSSIBLE SOLUTIONS:
We suspect Lua's garbage collection may have an issue: maybe individually not-intensive things, like individual templates, do not free up all their memory when they finish. If the issue is with Lua tasks not releasing memory when done, it would be ideal to improve that.

We also know some of our code is resource-intensive, so perhaps some errors could be solved if you could help us identify which pieces of code used on the above-linked revisions use the most memory, and how they could be streamlined without losing functionality (like automatic transliteration, etc).

Perhaps it would be possible to add "a function to create a 'cache' of language and script objects that can be used by multiple functions"; see explanation and rationale at https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2017/May#lsobjectcache .

Alternatively, perhaps it would be preferable to just raise the amount of memory that pages can use. (How would that affect mobile users?)

EXTRA DETAILS:
There is some randomness to the errors : I loaded https://en.wiktionary.org/w/index.php?title=water&oldid=42726961 and memory ran out in one place; I hard-refreshed and it ran out in another place. At the same time, there has consistently been no error in https://en.wiktionary.org/wiki/a or https://en.wiktionary.org/wiki/woman , even though "a" is larger than "man"; "a" and "woman" use more templates than "man"; and "woman" has ~2x as many translation templates (which are known to be resource-heavy and to contribute to errors) as "man".

Event Timeline

Urbanecm subscribed.

@Framawiki Are you sure this is Wikimedia Site request? I think this is related to Scribunto more than being a site request.

Anomie subscribed.

We suspect Lua's garbage collection may have an issue

"We suspect" isn't helpful. Provide actual test cases that illustrate that it's actually a problem in Scribunto rather than your module just using too much memory. A 194k page with a multitude of nested template invocations that happens to have errors isn't a test case.

I note if I edit that revision of your 'water' page and change all the uses of {{t}}, {{t+}}, {{t-check}}, {{t-image}}, {{t-needed}}, and {{t-simple}} to use the same language for the first parameter (in my test, I used 'de') without changing anything else at all, it mo longer produces errors. That also causes the page to load 90 fewer modules, mostly "Module:foo-translit". Your "garbage collection issue" might simply be that every one of those modules is cached so they don't have to be reparsed from scratch each time they're used.

Perhaps it would be possible to add "a function to create a 'cache' of language and script objects that can be used by multiple functions"; see explanation and rationale at https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2017/May#lsobjectcache .

It's not possible to create an mw.loadData with functions, since that would be an instance of T67258: Information can be passed between #invoke's (tracking).

The purpose of mw.loadData isn't to save memory, it's to save the time that would otherwise be required to reparse the data module every time.

Alternatively, perhaps it would be preferable to just raise the amount of memory that pages can use. (How would that affect mobile users?)

While raising the memory limit is possible, at some point in the future you'll probably wind up with an even-huger page that would exceed the new limit.

Another case from it.wikisource: https://it.wikisource.org/wiki/Ricordi_di_Parigi/Uno_sguardo_all%E2%80%99Esposizione

It pops out into a proofread ns0 pages with multiple calls to an it.wikisource module: Modulo:Wl

Using a test page with lots of identical calls to Modulo:Wl, really the "NewPP limit report" into source html tells "Lua memory usage: 50 MB/50 MB", t.i. Lua used really the whole amount of available memory.

Article wiktionary:si beginning from the section Slovak demonstrates multiple "Lua error: not enough memory".

It is affecting this page, from Vietnamese on: https://en.wiktionary.org/wiki/o

As noted by Anomie above, those reports "it affects this page" aren't much helpful. Those issues are very likely about a module written in an inefficient way, thus hitting the security memory limits. If that's right, it would be something that needs to be solved by on wiki developers who maintain the code.

It is possible (albeit unlikely) this is about an actual problem in Lua or the way we integrate it into our systems. However, in order to text that theory, a small test case page that produces the error would be required.

Unless I have misunderstood, Caoimhin has provided a link to a suitable test page that reproduces this fault, as have others - consistently and across platforms in my experience. This should allow someone to narrow down where the fault is. But it is not possible to provide the small test case page that you request, because it does not happen on small pages. It happens towards the end of exceptionally long pages. That is why you will find it for any really common word (such as a or o) in languages that are near the end alphabetically, such as Vietnamese and Welsh.

We do not need more "happens also on page..." test cases.
We welcome test cases though that illustrate that it's actually a problem in Scribunto (!). And not that local modules use too much memory.

This should allow someone to narrow down where the fault is.

https://en.wiktionary.org/wiki/o loads 288 templates and 163 modules. That takes memory. :)

The problem seems to be that the version of Lua used by Scribunto does not run the garbage collector when a memory allocation fails, so memory is not reclaimed when it is needed most. I would consider that a bug of the the implementation. With that behaviour it's hard to tell if we're really out of memory, or if we're just out of luck, because the GC didn't run in time.

The problem seems to be that the version of Lua used by Scribunto does not run the garbage collector when a memory allocation fails, so memory is not reclaimed when it is needed most. I would consider that a bug of the the implementation. With that behaviour it's hard to tell if we're really out of memory, or if we're just out of luck, because the GC didn't run in time.

Can you tell us how did you reach to that conclusion?

The problem seems to be that the version of Lua used by Scribunto does not run the garbage collector when a memory allocation fails, so memory is not reclaimed when it is needed most. I would consider that a bug of the the implementation. With that behaviour it's hard to tell if we're really out of memory, or if we're just out of luck, because the GC didn't run in time.

Can you tell us how did you reach to that conclusion?

Lua 5.2 includes an EmergencyGarbageCollector, which means that the GC always gets a chance to run when a memory allocation fails.

There is no such thing on Lua 5.1, which Scribunto is based on, so it won't try to free memory when needed. This also leads to unpredictable behaviour (pages fail, then work after a reload).

There are plenty of signs that the GC is unpredictable and that this is making this issue practically impossible to solve on en.wikt's side without dropping Lua modules altogether.

Just today on en.wiktionary we performed an overhaul of the language data modules (which are loaded with mw.loadData) where we moved some fields (otherNames, varieties, aliases, all of which are tables of strings except varieties can also have tables within tables) to a separate set of "extra data" modules. This way, these fields which aren't really needed most of the time don't need to be loaded in.

This should under all conceivable circumstances reduce memory usage, but for some pages it didn't. An example is https://en.wiktionary.org/wiki/fan which is currently filled with "not enough memory" errors. Before these changes, the NewPP report had

Lua time usage: 2.516/10.000 seconds
Lua memory usage: 47996832/52428800 bytes
Lua Profile:
    Scribunto_LuaSandboxCallback::gsub                               460 ms       17.3%
    recursiveClone <mwInit.lua:41>                                   440 ms       16.5%
    ?                                                                400 ms       15.0%
    pcall                                                            240 ms        9.0%
    Scribunto_LuaSandboxCallback::getAllExpandedArguments            180 ms        6.8%
    Scribunto_LuaSandboxCallback::callParserFunction                 120 ms        4.5%
    Scribunto_LuaSandboxCallback::expandTemplate                     120 ms        4.5%
    Scribunto_LuaSandboxCallback::match                              100 ms        3.8%
    type                                                              80 ms        3.0%
    [main chunk] <package.lua>                                        60 ms        2.3%
    [others]                                                         460 ms       17.3%

but after these changes, it is now

Lua time usage: 2.480/10.000 seconds
Lua memory usage: 52428799/52428800 bytes
Lua Profile:
    recursiveClone <mwInit.lua:41>                                   420 ms       16.0%
    ?                                                                320 ms       12.2%
    Scribunto_LuaSandboxCallback::getAllExpandedArguments            240 ms        9.2%
    pcall                                                            200 ms        7.6%
    Scribunto_LuaSandboxCallback::gsub                               180 ms        6.9%
    Scribunto_LuaSandboxCallback::find                               160 ms        6.1%
    dataWrapper <mw.lua:668>                                         100 ms        3.8%
    Scribunto_LuaSandboxCallback::callParserFunction                 100 ms        3.8%
    (for generator)                                                  100 ms        3.8%
    validateData <mw.lua:724>                                         80 ms        3.1%
    [others]                                                         720 ms       27.5%

The page does not load the extra data modules (verified easily by previewing with a version of Module:languages that errors out of loadInExtraData if trying to load).

Module:languages before the overhaul: https://en.wiktionary.org/w/index.php?title=Module:languages&oldid=63434634
Module:languages after the overhaul: https://en.wiktionary.org/w/index.php?title=Module:languages&oldid=63603383
(note that previewing with just these isn't enough. you'd also need separate versions of Module:languages/data* that use the old format. I can set up an example if desired)

If Lua on MediaWiki can't be upgraded to 5.2 or later (T178146 is stalled, with "re-evaluation in 2024"), maybe just the GC changes could be backported to 5.1, to have at least some predictable GC behaviour?

If Lua on MediaWiki can't be upgraded to 5.2 or later (T178146 is stalled, with "re-evaluation in 2024"), maybe just the GC changes could be backported to 5.1, to have at least some predictable GC behaviour?

They don't need to. It's actually a bit simpler than that. LuaSandbox implements its own wrapper for the memory allocator that is passed to Lua explicitly for keeping track of the memory limit (you've allocated 64 bytes after already allocating 50 MB - 32? no allocation for you). Right now, it's clear that LuaSandbox holds to an invariant that there is only lua_State per php_luasandbox_alloc. Because of that, you could just solve it like this:

  1. Add two int flags: enable_gc and in_gc to struct php_luasandbox_alloc. Initialize the first one to 1 when creating a new struct php_luasandbox_alloc and the latter to 0.
  2. Add a lua_State * field (let's call it state) to struct php_luasandbox_alloc. Initialize this in luasandbox_alloc_new_state and set it to NULL in luasandbox_alloc_delete_state.
    1. If it isn't already NULL in luasandbox_alloc_new_state, that means you have multiple Lua states per alloc. In this case, set enable_gc to 0.
  3. In luasandbox_update_memory_accounting: if we exceed the memory limit. check enable_gc and in_gc. if the first one is 1, the second one is 0 and state != NULL, do
    1. set alloc->in_gc to 1
    2. call lua_gc(alloc->state, LUA_GCCOLLECT, 0)
    3. if allocation now succeeds, set alloc->in_gc to 0. if it fails, do that anyway. the check whether we hit the memory limit should happen BEFORE in_gc is set to 0, else we will call GC in an infinite loop.

Calling lua_gc from a lua_Alloc is completely safe given you don't end up calling it again while it's already running (hence the in_gc). GC shouldn't allocate any memory but only free it, so the GC should never fail (but even if it does, all that happens is that the GC fails and then we can return NULL from lua_Alloc safely).

Change 968013 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Increase Lua memory limit to 100MB on Wiktionary only

https://gerrit.wikimedia.org/r/968013

I am proposing to do the following:

  • Increase the memory limit to 100MB on *.wiktionary.org
  • Run the GC more aggressively, so that the threshold memory usage for a random OOM due to a deferred GC run will be increased from 50% to about 90% of the limit.

Together, these changes should give Wiktionary 3.6 times more memory, and should make out-of-memory errors more predictable.

In T165935#3281460, Anomie wrote:

A 194k page with a multitude of nested template invocations that happens to have errors isn't a test case.

I disagree. It is a usable test case.

While raising the memory limit is possible, at some point in the future you'll probably wind up with an even-huger page that would exceed the new limit.

Memory usage of Wiktionary pages seems to be increasing slower than Moore's law and slower than our PHP memory_limit which now stands at 1400MB. So it should be fine to continue bumping it up every once in a while. It's easier to bump the memory limit than to rearchitect the site.

Change 968013 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase Lua memory limit to 100MB on Wiktionary only

https://gerrit.wikimedia.org/r/968013

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:03:00Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:04:28Z] <samtar@deploy2002> samtar and tstarling: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:10:51Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] (duration: 07m 51s)

Just FYI, although https://en.wiktionary.org/wiki/a now loads without errors, the following (expected) log entry is generated: Parsing 'a' was slow, took 30.63 seconds

Just FYI, although https://en.wiktionary.org/wiki/a now loads without errors, the following (expected) log entry is generated: Parsing 'a' was slow, took 30.63 seconds

This is almost entirely down to the lite templates, which have been designed to be Lua-free. The trade-off is that they have a lot of complex wikicode, so take a lot longer to parse (and also have a very large post-expand size, too). Now that it's possible to use normal templates, those two issues should improve.

Can this be closed as resolved now?

It does seem to happen again, e.g. https://en.wiktionary.org/wiki/opistho- . It's not a particularly problematic page though, still the whole page is one big 'not enough memory' error

image.png (783×1 px, 110 KB)

Also see https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2024/April#Lua_error

It does seem to happen again, e.g. https://en.wiktionary.org/wiki/opistho- . It's not a particularly problematic page though, still the whole page is one big 'not enough memory' error

image.png (783×1 px, 110 KB)

Also see https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2024/April#Lua_error

This was an unrelated issue caused by a mutual dependency loop between two core modules (i.e. each required the other one, so got into an infinite loop trying to load each other). Although the issue was resolved very quickly, there is often a period of several days while the error and fix propagate out across all pages affected, so the error might crop up on pages seemingly at random.