Page MenuHomePhabricator

mw.wikibase.getBestStatements is slow when used on objects with many properties
Open, Needs TriagePublic

Description

https://commons.wikimedia.org/wiki/Module:Sandbox/Commonscat queries the single statement P373 (Commons category). When running this on a small object (e.g. with {{#invoke:Sandbox/Commonscat|getCommonscat|Q1958152}} on commons), the profiling data reports 6 or 7 milliseconds Lua time. However, when running it on a large object like Q30, it takes significantly more, around 100 milliseconds.

My understanding of the documentation for mw.wikibase.getBestStatements is that it only loads the one property that's queried, not the whole object, so there would be no reason why querying a large object would differ in performance from querying a small object.

Event Timeline

My understanding of the documentation for mw.wikibase.getBestStatements is that it only loads the one property that's queried, not the whole object, so there would be no reason why querying a large object would differ in performance from querying a small object.

No, it has to load and deserialize the whole entity (though it only returns a small part of it), since it’s all stored as one JSON blob. (I believe the result is then cached on several levels, so subsequent requests for other properties from the same entity have a chance of being more efficient.) We might be able to improve on this eventually, but I don’t think there’s any surprising hidden costs at play here which we could easily fix.

No, it has to load and deserialize the whole entity (though it only returns a small part of it), since it’s all stored as one JSON blob.

Really? This would be a very unfortunate design decision, since it would mean that those entities with larger amounts of data (which tend to be the most interesting ones == the most frequently queried ones) are most inefficient in resource usage.

Actually I wonder how query.wikidata.org can be so much faster - https://w.wiki/Dho which queries P373 for all 199 countries is done in a little more than 100 Milliseconds in total. So obviously there must be some way to access that property without loading and parsing the whole JSON blob.

The query service has a completely separate data store, and that one supports very granular data access, but isn’t directly accessible to MediaWiki/Wikibase. (It’s constantly synchronized by an updater that reads Wikidata’s stream of recent changes and issues appropriate SPARQL UPDATE queries.)

Ouch. I hope this can be changed soon, I think that many practical usages of Wikidata in the Wikimedia universe are currently not possible because of this issue, and I also think that even the current Wikidata usages consume magnitudes more resources (CPU time, disk access, electrical power, network bandwidth...) than what would be necessary.

Addshore moved this task from incoming to needs discussion or investigation on the Wikidata board.
Addshore subscribed.

Not super high on the priority list at this moment, but something we will keep on the radar, and will probably link in with more things as time progresses.

In this regard I made the proposal of T179638 about two years ago.