Page MenuHomePhabricator

Add a Lua call that returns a list of defined properties without requiring a call to getEntity
Open, Needs TriagePublic

Description

It would be useful to be able to easily and efficiently fetch a list of the properties that are defined in a given Wikidata item via Lua - for example, so that infoboxes that fetch specific properties can be made more efficient by only fetching properties that have values.

As things stand, to get a list of the defined properties in a Wikidata item, it seems that you have to call mw.wikibase.getEntity first, and then call entity:getProperties(). This returns more information from the Wikidata item than is necessary, which increases server load.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 24 2020, 7:27 PM
Verdy_p added a subscriber: Verdy_p.EditedMay 25 2020, 9:06 AM

And please add getEntityByLang(lang), which will implement language fallbacks, so that we can load (and cache) entities only using relevant languages; and remove all sitelinks from this call: sitelinks can be loaded selectively on specific entities (and in general it is only the entity related to the current page: these sitelinks are already loaded in the languages side bar, and can have their own separate cache keeping loaded sitelinks for that single entity).

The cache of entities with selective lang could store more entities without more cost in memory (for now it's limited to 15 entities, not enough for infoboxes and many Lua modules that perform repeated calls to the wikidata server for the same entity)

And as I said in Commons, the current JSON model is way to verbose to be exposed "as is" in Lua:

  • compress snaks and datavalues into a single table:
    • remove the "snaktype" and "datatype" internals, just keep "type"
    • remove the "datavalue" internal, just keep "value"
  • remove other internal legacy fields ("hash", "id", "ns", "title") from properties
  • remove superfluous language codes and sitecodes in individual labels, descriptions, aliases, or sitelink, and the subtable grouping them with the actual value (this saves one subtable per label, description, alias or sitelink).

This can become a new more compact JSON model (the old model is just a legacy that may be kept for compatibility but no longer recommanded) and should be the model used for the cache of entities to save memory.

I've made tests already, and the number of tables for a full entity (even without filtering any language, or sitelinks) is divided by 3, the total number of strings is also divided by 4 to 5 sometimes more, and using it in Lua modules is much faster and simpler (with mich less table lookups, so it saves CPU time as well)! This is significant for highly populated entities (like the "United States" or international sport competitions)

As well check that Scribunto uses the 32-bit version and not the 64-bit version of Lua (given the memory contraints of ~50MB for Lua scripts in Scribunto, the 64-bit version just wastes memory for lot of pointers inside tables and for pointing to strings, the 64-bit version would offer no interest as we will never allow scripts used in Scribunto to use more than 3 Gigabytes).

Finally Lua's implementation of tables currently suffers from a poor management of collision lists (large tables used in several modules modify tables a lot, and all slots merge into the same collision list, so that tables no longer behave as "fast hashed lookups" but turn into slow full scans of long lists! This is caused by several factors:

  • poor hashing function for strings longer than 12 bytes
  • poor scanning method using chained nodes (that also use too much memory per node in tables; no chainining pointer is needed in each node, as they can chained by simple linear steps, modulo the table size: this saves about 33% memory per table.
  • tables can only be sized with a number of nodes that are an exact power of two (this is a problem for tables with more than 16 keys: they waste too much memory): the current table size growth factor 2 is too large once you have too many distinct keys: you'll enlarge table with too many free slots.
  • tables cannot be preallocated for a known target number of keys (e.g. when PHP will convert the JSON data to create and load a Lua table): they are reallocated multliple times if you load many keys, and all keys are rehashed and all colision lists have to be recomputed for each reallocation. It's possible to allocate it once, load all data once without any rehashing, and then restore the normal growth/reduce factor (which should be about 1.25 and not 2 as it is today)

Such change currently requires "hacking" into the C code implementing Lua before compiling it (to 32-bit, not 64-bit for use in Scribunto!).

The server however may have a 64-bit version of Lua for server-side maintenance scripts running in the Wikidata server, where Lua is not used via Scribunto in normal wiki pages, and that may not have the 50MB limit (such maintenance script run by admins could be allowed to use Gigabytes).

To make sure I understand: You'd like to have a way in Lua to request for a given Item which Properties it uses for Statements but you're not interested in the value associated with it at this point.
Please excuse my Lua ignorance: why don't you just try to get the statement and handle the reply depending on if it returns a statement or not? Is this based on the assumption that this is cheaper overall? (ccing @hoo to confirm)

@Verdy_p Could you open a separate ticket for this? It seems quite different from the initial request so should be handled in a different ticket to not get lost.