Page MenuHomePhabricator

Could we optimize Lua environment setup to improve {{#invoke:}} performance?
Open, Needs TriagePublic

Description

Lua profiling shows that the most CPU time consuming are the calls to the recursiveClone() local function.

This comes from the recursive cloning of the _G global table, when creating a new execution environment, see the mw.clone( _G ) at the beginning of mw.executeModule().

That seems to be the major part of the "Lua call overhead", i.e. the incompressible time when {{#invoke:}} is used. In the vast majority of cases, the actual Lua code from the userland module takes much less CPU time than this overhead.

Thus, any idea if/how this could be improved?

Event Timeline

From my tests, the speed improves by about 10% if the check for non-tables and already-seen values is in the for-loop, rather than at the start of the recursive call. I've also minimised the number of table accesses.

function mw.clone( val )
    if type( val ) ~= "table" then
        return val
    end
	
    local tableRefs = {}
    local function recursiveClone( val )
        local retVal = {}
        tableRefs[val] = retVal

        local mt = getmetatable( val )
        if mt then
            setmetatable( retVal, recursiveClone( mt ) )
        end

        for key, elt in next, val do
            if type(elt) ~= "table" then
                retVal[key] = elt
            else
              local seen = tableRefs[elt]
              retVal[key] = seen ~= nil and seen or recursiveClone(elt)
            end
        end
        return retVal
    end
	
    return recursiveClone( val )
end
}

Note that this makes an initial type check necessary, in case a non-table value is passed into the main function.

So there is some room for improvement, thank you for investigating this.

I see, rather than calling recursiveClone() on each value, which in turn calls type() == table, you do the type() == table filtering directly inside the loop, thus saving the overhead of calling recursiveClone() function for every non‑table value.

Other optimizations in your code I noticed:

  • Replace pairs() with for a for ... in next ... construct. This construct quite surprised me, here is a topic on Stack Overflow that may help understand it better.
  • Don't call getmetatable() twice, by using a local variable.
  • Don't access tableRefs[elt] twice, by using a local variable.

I should have mentioned why I chose for key, elt in next, val do: it's because it avoids any __pairs metamethod, which would prevent a true clone of the table being created. This wouldn't work with data loaded via mw.loadData, but (a) mw.clone already throws an error if you pass data loaded via mw.loadData into it, and (b) even if it didn't, it defeats the purpose of mw.loadData to make a local clone of the data, so you may as well just load it via require in the first place.

Here's a version that's about 15% faster than the original:

  • It avoids generating a new closure every time the main function is called.
  • tableRefs isn't an upvalue, so access is faster.
  • Avoids any global variables.
do
    local getmetatable = getmetatable
    local next = next
    local setmetatable = setmetatable
    local type = type
	
    local function recursiveClone( val, tableRefs )
        local retVal = {}
        tableRefs[val] = retVal

        local mt = getmetatable( val )
        if mt then
            setmetatable( retVal, recursiveClone( mt, tableRefs ) )
        end

        for key, elt in next, val do
            if type( elt ) ~= "table" then
                retVal[key] = elt
            else
                local seen = tableRefs[elt]
                retVal[key] = seen ~= nil and seen or recursiveClone( elt, tableRefs )
            end
        end
        return retVal
    end

    function mw.clone( val )
        if type( val ) ~= "table" then
            return val
        end
        return recursiveClone( val, {} )
    end
end

I've also noticed recursiveClone having a big impact on certain pages (e.g. the profiler says it takes 1160ms on the largest English Wiktionary page, a), so it would be good to make it as fast as possible.

  • The global lookups to getmetatable, etc. functions are cached so the (numerous) recursiveClone() calls avoid them. Considering the gigantic number of recursiveClone() calls, that's certainly the biggest optimization of your new variant.
  • And by using the do...end block, mw.clone() calls (and they add up, at least one per {{#invoke}}) benefit from this lookup cache as well (instead of creating a new set of caching variables every time).
  • Nice one too with the scope of tableRefs (refs The Implementation of Lua 5.0 [PDF] - Sections 5: Function and Closures)

When I created this ticket, I thought the current function could not be optimized further, and actually I was in hope that someone would figure out a different paradigm to avoid the extensive cloning.

But as the function can be optimized for a 15 % gain, that's already a huge improvement for Lua performance on the wikis.

While doing some random test, I noticed the 4 assignments local getmetatable = getmetatable, etc. actually make the code slower.
What makes the code slower is not the assignment, but the access to the local variable instead of the global function. It's as if the Lua engine had some internal optimization for these global accesses.

Is there a reason not to submit a patch, summarising the improvements above, immediately?

Is there a reason not to submit a patch, summarising the improvements above, immediately?

Looking back at this and giving it a rethink, the use of next, val would actually be changing the spec of the function (which could potentially break existing implementations which rely on cloning with metamethods). Unless there is consensus for that change in the spec, then the submitted version should use pairs( val ) instead. My personal opinion would be to support that change, but it's a separate issue to the one under discussion in this thread, so should be saved for another time.

Other than that, I'd be happy for it to be submitted as a patch. (Also note that pairs replaces next as one of the local variables at the top of the block.)

do
    local getmetatable = getmetatable
    local pairs = pairs
    local setmetatable = setmetatable
    local type = type
	
    local function recursiveClone( val, tableRefs )
        local retVal = {}
        tableRefs[val] = retVal

        local mt = getmetatable( val )
        if mt then
            setmetatable( retVal, recursiveClone( mt, tableRefs ) )
        end

        for key, elt in pairs( val ) do
            if type( elt ) ~= "table" then
                retVal[key] = elt
            else
                local seen = tableRefs[elt]
                retVal[key] = seen ~= nil and seen or recursiveClone( elt, tableRefs )
            end
        end
        return retVal
    end

    function mw.clone( val )
        if type( val ) ~= "table" then
            return val
        end
        return recursiveClone( val, {} )
    end
end

Another idea I had for this is to make most of the data in mw.site available via a metatable, in the same way mw.loadData is. Removing mw.site and package.loaded["mw.site"] from _G speeds up mw.clone(_G) by about 4.5 times, which is a major reduction, and there's no reason for anyone to be writing to those tables anyway.

There are a handful of functions, so it wouldn't be as simple as loading mw.site itself that way, but it shouldn't be too difficult.

So the mw.site.* calls would be more expensive, but the improved bootstrap time would largely compensate for it. That's another interesting idea.

What would be cool is, for the various optimizations posted on this thread, to separately measure how much they improve. I had posted above that one of the optimizations might be counterproductive, and it may be worth checking the other ones. But more generally, I'd be interested to see if some optimizations are responsible for most of the gains.

While doing some random test, I noticed the 4 assignments local getmetatable = getmetatable, etc. actually make the code slower.
What makes the code slower is not the assignment, but the access to the local variable instead of the global function. It's as if the Lua engine had some internal optimization for these global accesses.

I couldn't replicate this in ZeroBrane - the difference was essentially negligible, but slightly favouring the version which contained local getmetatable = getmetatable.

Change #1030604 had a related patch set uploaded (by Theknightwho; author: Theknightwho):

[mediawiki/extensions/Scribunto@master] Optimisations for mw.clone

https://gerrit.wikimedia.org/r/1030604

I couldn't replicate this in ZeroBrane - the difference was essentially negligible, but slightly favouring the version which contained local getmetatable = getmetatable.

I was also thinking about this, if some of the optimizations don't bring a significant improvement, maybe they should be discarded rather than complexifying the code.

I had measured the performances on Lua 5.1, to match the version used on Wikipedia. I hope Wikipedia will update Lua some time, but for now we're stuck on 5.1. Did you run your benchmarks on this version?

Change #1030604 merged by jenkins-bot:

[mediawiki/extensions/Scribunto@master] Optimisations for mw.clone

https://gerrit.wikimedia.org/r/1030604