If a string exceeds about 8000 codepoints, function mw.ustring.gcodepoint causes a stack overflow
https://de.wiktionary.org/wiki/Modul:User:Formatierer/String (function p.longString)
|mediawiki/extensions/Scribunto : master||Ustring: Let gcodepoint work with moderately long strings|
A somewhat more useful test case:
local s = "" for i = 0, 799 do s = s .. "0123456789" end mw.ustring.codepoint(s,1,8000)
The "stack overflow" itself isn't all that much of a concern, though. Lua's string module can do the same thing:
local s = "" for i = 0, 799 do s = s .. "0123456789" end string.byte(s,1,8000)
gives an error "stack overflow (string slice too long)" as well. I can't think of any great way to improve the error message for mw.ustring.codepoint, as it would basically require unpacking the array twice, once inside a pcall to detect the error condition and the second not to actually return the value.
gcodepoint, however, shouldn't suffer from this issue and should be fixed.
In MediaWiki 1.29.0-wmf.7 this now works for strings longer than 8000 Codepoints, but the implementation still seems to be a bit inefficient, because now a string longer than about 30000 - 35000 codepoints causes an error due to cpu time limitations.
At some point it's just going to take more CPU time to process than Scribunto allows. But if you want to try to make Scribunto's code more efficient you're welcome to contribute.
On the Lua side it starts here. The gcodepoint_init function it calls is in PHP, although that calls the PHP method just above to do most of the work. Maybe it'd be more efficient to call into PHP to fetch the one codepoint for each iteration (or to fetch batches of a few tens or hundreds) instead of generating the whole array at once.
You could also look into the luasandbox PHP extension to see if you can improve efficiency there, particularly the code that turns a PHP array into a Lua table used when returning the array of codepoints to Lua (there's also the code that passes the string from Lua to PHP if you want, but that's very straightforward).
Ok, challenge accepted. I think the inefficency comes from the table.remove function. The longer the array is, more elements have to be shifted to the front of the array after removing the first element. As you said returning a single codepoint on every call without constructing the table would do the job. But you can also use a counter for returning the single elements from the array and free the memory of the array in a last step. Something like this.
local function php_gcodepoint( s, i, j ) checkType( 'gcodepoint', 1, s, 'string' ) checkType( 'gcodepoint', 2, i, 'number', true ) checkType( 'gcodepoint', 3, j, 'number', true ) local cp = gcodepoint_init( s, i, j or -1 ) local pos = 0 local cp1 return function () cp1 = cp[pos] pos = pos + 1 -- free memory if (pos >= #cp) then cp = nil end return cp1 end end
I think you're on to something there. You should put a patch for that into Gerrit so I can review it, see https://www.mediawiki.org/wiki/Gerrit/Getting_started for help getting started. I could do it, but then I would need to find someone else to review it and that can take a while sometimes.
A few notes to get started:
- I'd make cp1 a local inside the returned function instead of an upvalue, and give it a better name (even tmp would be better IMO).
- cp is 1-based like other Lua sequence-tables, so cp isn't a valid index and you're ending one too early.
- You'll need to add a check for cp == nil and return nil in that case.
- Code style: Indent with 1 tab, and omit the optional parentheses in the if condition.
I can wait. Following code will do it and let the garbage collector free the memory automatically.
local function php_gcodepoint( s, i, j ) checkType( 'gcodepoint', 1, s, 'string' ) checkType( 'gcodepoint', 2, i, 'number', true ) checkType( 'gcodepoint', 3, j, 'number', true ) local cp = gcodepoint_init( s, i, j or -1 ) local pos = 0 return function () if cp == nil then return nil end pos = pos + 1 return cp[pos] end end