Page MenuHomePhabricator

Wikidata Entities are getting too big
Open, Needs TriagePublic

Description

Some Items are at the size limit that we defined in configuration. Based on the exceptions we're seeing, this started ca. August 2022:

image.png (297×671 px, 25 KB)

(source) (The fact that these exceptions are showing up in our logs is an additional, but separate problem. See T320945)

Metrics on Entity Size: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel?orgId=1&refresh=30m&viewPanel=2&from=now-1y&to=now

This is causing a bunch of problems:

Notes:

Event Timeline

There are different sizes calculated for Items/Entities in different places

Can you please elaborate on this? I only know of two places where this comparison is made: once when encoding entity content (which throws this exception) and once when decoding it (which results in a different error), both of which are in Wikibase\Lib\Store\EntityContentDataCodec. At least those are the comparisons made against the maxSerializedEntitySize configuration. Unless you mean there are other configurations to take into account here?

There are different sizes calculated for Items/Entities in different places

Can you please elaborate on this? I only know of two places where this comparison is made: once when encoding entity content (which throws this exception) and once when decoding it (which results in a different error), both of which are in Wikibase\Lib\Store\EntityContentDataCodec. At least those are the comparisons made against the maxSerializedEntitySize configuration. Unless you mean there are other configurations to take into account here?

No, I mean that those sizes produce different numbers than what is visible in the revision history of an Item.

For example, per the error added as Screenshot in the description, the maximum Entity size is 2.93 MB, or a bit over 3,000,000 bytes. But the revision history of the biggest Item gives its size as 4,413,904 bytes.

Similarly, in Grafana we seem to record the size of the largest Entity the way the revision history does, but have set the alert / red area to just about 3 MB: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel?orgId=1&refresh=30m&viewPanel=2&from=now-1y&to=now

Haven't looked into it yet, but wouldn't this be a result of calculating the size of a serialized entity for the comparison vs. displaying the size of the full page content? Either / or, maybe this should be a separate ticket. @Lydia_Pintscher which size do you recon should actually be displayed in the revision history?

While this does not solve the immediate issue I have some thoughts about the underlying problem (based on looking through Special:LongPages):

  • most of the size seems to come from ordered lists
  • most of the longest Items seem to be
    • lists of authors (scientific papers)
    • all kinds of lists of data points

Don't we have something like a tabular data type that is more suited to lists of data points? Externalizing this into tables seems like a good solution for lists of data points conceptually. It's likely just a question of making external tables more queriable etc to make this work in practice as well.

This is different for authors and other connected Items. If we put them in a list, we would likely lose some information in the graph. But maybe there is still a more efficient way to do this. As this seems relevant for only relatively few Items, I guess I could also live with a workaround as a solution here (like e.g. having to splitting up big Item lists into separate Items).