Page MenuHomePhabricator

Lookup of cached entities by mw.wikibase.getEntity in Wikipedia as a client of Wikidata is very slow
Open, Needs TriagePublic

Description

The user Dipsacus_fullonum noticed and made a quick fix for mw.wikibase.getEntity() in w:no:Modul:External links.

Not discussing the merits of the module, this function is awfully slow even if it is cached. Some numbers from the thread at Moduldiskusjon:External links#Funktionen mw.wikibase.getEntity er meget langsom. The numbers are from checks run in the console

timing=require 'Module:Timing'
=timing.count(1)
=timing.sets(1)
=timing(mw.wikibase.getEntity, 'Q28614')
=timing(mw.wikibase.getEntity, 'Q28614')

First timing check gives

Each call was running for about 8.3e-02 seconds.
	Mean runtime for each set was 8.3e-02 seconds,
	with standard deviation of 0.0e+00 seconds,
	minimum 0.0e+00, maximum 0.0e+00.
	Total time spent was about 8.3e-02 seconds.
Relative load is estimated to inf.
Second timing check gives
Each call was running for about 1.0e-02 seconds.
	Mean runtime for each set was 1.0e-02 seconds,
	with standard deviation of 0.0e+00 seconds,
	minimum 0.0e+00, maximum 0.0e+00.
	Total time spent was about 1.0e-02 seconds.
Relative load is estimated to inf.

A new run for 100 calls after reload

timing=require 'Module:Timing'
=timing.count(10)
=timing.sets(10)
=timing(mw.wikibase.getEntity, 'Q28614')

Gives

Each call was running for about 1.1e-02 seconds.
	Mean runtime for each set was 1.1e-01 seconds,
	with standard deviation of 2.5e-02 seconds,
	minimum 0.0e+00, maximum 2.0e-05.
	Total time spent was about 1.1e+00 seconds.
Relative load is estimated to 57,461.0.

That is bad as it implies the cached version runs in about 10ms while the uncached runs in about 80ms. Something in the code does it wrong, as this should be close to 1ms or better way below. Ideally this is just an intercept on a call and returning a reference to the previously loaded instance.

(Without checking this, I somehow remember [Note that I could be wrong!] the code is creating a deep copy. It should probably create a non-mutable object. It is slightly harder to work with this kind of objects, but it would make it possible to return a simple reference instead of a copy. To make it possible to get a copy a second argument could be used.)

Event Timeline

Restricted Application added subscribers: Danmichaelo, Aklapper. · View Herald Transcript
jeblad renamed this task from Lookup of cached entities by mw.wikibase.getEntity in a client of Wikidata is very slow to Lookup of cached entities by mw.wikibase.getEntity in Wikipedia as a client of Wikidata is very slow.Sep 2 2018, 4:25 PM
jeblad updated the task description. (Show Details)

Looking at the source code of mw.wikibase.lua, we see:

-- Use a deep clone here, so that people can't modify the cached entity
return entityModule.create( mw.clone( entity ) )

Using git log -S'clone here' I was able to find a commit by @hoo that added the cloning and linked to T76946, which tells us that the motivation behind the change was to make the cached tables immutable so that #invoke cannot have side effects that influence the next #invoke.

I guess hoo decided not to make the cached tables read-only using metatables because rawset can bypass the protection. I remembered that mw.loadData uses read-only tables so I wondered how it deals with rawset. I found rELUAd9c7d5f971f1, a patch that makes it possible to modify the table returned by mw.loadData without using rawset or introducing cross-invoke leakage.

Below is a proof-of-concept based on the patch. Please don't ask me how dataWrapper works.

-- This creates an read-only dummy table for accessing the real data.
--
-- @param data table Data to access
-- @param seen table|nil Table of already-seen tables.
-- @return table
local function dataWrapper( data, seen )
	local t, dummyValue, changes = {}, {}, nil
	seen = seen or { [data] = t }
	local function pairsfunc( s, k )
		k = next( changes or data, k )
		if k ~= nil then
			return k, t[k]
		end
		return nil
	end
	local function ipairsfunc( s, i )
		i = i + 1
		if t[i] ~= nil then
			return i, t[i]
		end
		-- no nil to match default ipairs()
	end
	local mt = {
		__index = function ( tt, k )
			assert( t == tt )
			if changes and changes[k] ~= dummyValue then
				return changes[k]
			end
			local v = data[k]
			if type( v ) == 'table' then
				seen[v] = seen[v] or dataWrapper( v, seen )
				return seen[v]
			end
			return v
		end,
		__newindex = function ( tt, k, v )
			assert( t == tt )
			if not changes then
				changes = {}
				for k in pairs( data ) do
					changes[k] = dummyValue
				end
			end
			changes[k] = v
		end,
		__pairs = function ( tt )
			assert( t == tt )
			return pairsfunc, t, nil
		end,
		__ipairs = function ( tt )
			assert( t == tt )
			return ipairsfunc, t, 0
		end,
	}

	-- This is just to make setmetatable() fail
	mt.__metatable = mt

	return setmetatable( t, mt )
end

local cached = { P31 = 'Q5' }

local function getEntity()
	return dataWrapper( cached )
end

-- This demonstrates that cross-invoke communication is impossible
local e1 = getEntity()
local e2 = getEntity()
e1.P31 = 'muhaha'
local e3 = getEntity()
print( e1.P31, e2.P31, e3.P31 ) --> muhaha  Q5  Q5

This comes with the same limitations mw.loadData has: the length operator #, next(), and the functions in the table library do not work. But we can index the table and iterate over it, which is all that really matters. IMHO, being able to modify the entity table is unnecessary so using mw.loadData's implementation instead of the above patch would also suffice.

On Commons category pages that use the Wikidata Infobox, the Lua profiler often reports that recursiveClone takes up most of the time—try it yourself here. So getting rid of mw.clone would be a huge win for performance, especially since getBestStatements and getAllStatements are also effected by this.