Page MenuHomePhabricator

Don't store actual EntityUsage objects in ParserOutput, but just minimal identifier strings
Closed, ResolvedPublic

Description

Via ParserOutputUsageAccumulator we currently store a list of EntityUsage objects in a client page's ParserOutput. Given that these ParserOutput object themselves are fully (serialized and gzipped) saved in ParserCache (which is read and unserialized for various purposes), we should try to avoid attaching to much cruft to them.

When testing this locally, I was able to save about 23 bytes per EntityUsage (after gzip), by storing an array (identity string -> null) instead of (identity string -> EntityUsage). Given that all information in EntityUsage objects is part of the identity strings, storing just these will suffice.

Details

Related Gerrit Patches:

Event Timeline

hoo created this task.Oct 28 2019, 11:06 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptOct 28 2019, 11:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
hoo updated the task description. (Show Details)Oct 28 2019, 11:26 PM

Change 546763 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Don't store EntityUsage objects in ParserOutput, but minimal identifiers

https://gerrit.wikimedia.org/r/546763

Change 546763 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Don't store EntityUsage objects in ParserOutput, but minimal identifiers

https://gerrit.wikimedia.org/r/546763

hoo added a comment.Nov 6 2019, 1:24 AM

I collected some data about the parser cache sizes of the ruwiki pages with the most hits this month (as per stats.wikimedia.org).

With that I found that currently, for all of these articles:

  1. The average PC size is 103986.95
  2. The median PC size is 73419.5
  3. The minimum PC size is 13313
  4. The maximum PC size is 399637

For that I took the ParserCache objects, serialized and gzdeflated them, like SqlBagOStuff does. I also looked into using gzcompress with level 9, like MemcachedClient does, but that doesn't seem to have much of an impact.

With this data as baseline we can roughly estimate the impact this will have on highly visited pages.

The scripts I used can be found in P9535.

hoo added a comment.Mon, Nov 11, 9:59 AM

I purged all pages from above (which haven't been re-parsed since) and gathered new numbers:

  1. The average PC size is 101219.66 (down ~2.7%)
  2. The median PC size is 68137 (down ~7.2%)
  3. The minimum PC size is 12911 (down ~3.1%)
  4. The maximum PC size is 393856 (down ~1.4%)