Lua profiling shows that the most CPU time consuming are the calls to the recursiveClone() local function.
This comes from the recursive cloning of the _G global table, when creating a new execution environment, see the mw.clone( _G ) at the beginning of mw.executeModule().
That seems to be the major part of the "Lua call overhead", i.e. the incompressible time when {{#invoke:}} is used. In the vast majority of cases, the actual Lua code from the userland module takes much less CPU time than this overhead.
Thus, any idea if/how this could be improved?