Page MenuHomePhabricator

[SPIKE] Estimate growth in demand for Parser Cache storage
Open, MediumPublic

Description

This task represents the work with estimating how the demand for Parser Cache storage could change over time.

The need for such an estimate emerged in the 1 July 2021 meeting between the Editing, Data Persistence, Performance, and Parsing Teams.

Open question(s)

  • 1. How do we estimate the storage demand being placed on the parser cache will change over time? Note: the changes in said "storage demands" will be driven by parameters edit rate, data retention time, etc.

Done

  • Answers to all ===Open question(s) are documented within this task's description

Event Timeline

ppelberg edited projects, added Editing-team (Tracking); removed Editing-team.
ppelberg moved this task from Backlog to External on the Editing-team (Tracking) board.
Krinkle triaged this task as High priority.Jul 19 2021, 6:04 PM
Krinkle changed the task status from Open to Stalled.Dec 20 2021, 7:24 PM
Krinkle lowered the priority of this task from High to Medium.

Marking as stalled for now pending parent task's being unblocked. I might still squeeze this in earlier, but it's not urgent right now.

I spent some time on this. I did a sampling of 1.1% (3/256) and here is some notes.

One big takeaway here is that PC size is not correlated to the size of wikis or reads, it is mostly about Commons and Wikidata. The other half is other wikis. When it comes to records, 29.8% of all of ParserCache entries is commons. To explain how massive it is: It is bigger than the next eleven wikis combined (excluding wikidata, I'll explain below why).

You can basically assume something like this is could give a very rough esitmated number of records of the PC. The current total is 481M entries:

1.33A + B + 0.4C + 0.77D

Where

  • A is total number of pages in commons. Keep it in mind that is not about images, all pages.
  • B is total number of pages edited in wikidata in the last month. Currently around 7M.
  • C is total number of pages in Wikidata. All pages.
  • D is total number of pages in all wikis except Wikidata and commons.

The 0.77 index is pretty consistent across wikis I checked. But A could be broken down to number of images in commons and number of pages (outside ns6). It would be something like 1.4 for files and 1.1 for the rest.

Commons

Commons is a big mess. We actually improved this with action=render clean up but it didn't have as good as effect as I hoped, for a funny reason. Each wiki is using uselang=, making it fragmented by language. So most of action render fragmentation now turned into language ones. It is good to have it anyway.

If you try to see the size instead of records, the percentage actually increases to 31%, it means average PC entry for example is slightly bigger than the average entries in ParserCache which I think is a bit surprising.

This seems to be combination of multiple factors here leading to this massive storage:

  • a lot of refresh links jobs being triggered.
    • Commons have a lot of templates that are widely used and any change on them would cause a massive increase. One really good to tackle this is to move documentation of templates to its own slot.
    • A lot of refreshlinks job is being also triggered from wikidata which enforce the fact that we shouldn't enable rc injection for commons (T179010) but we can take a look at this a bit harder
      • Revisit and check Lua modules for wikidata to see if there is anything that we can improve
      • Making the tracking more granular (I'm not sure if that would help, the statement tracking is already disabled but maybe that triggers a job still?) needs a bit of investigation.
  • (I had some ideas about images but I forgot)

Wikidata

The number that should be in PC for this should be pretty low. We have made changes that avoids parsing and storing pages when a page is edited (assuming most of them are bot edits which don't need the parsed page) and yet still 10% of all PC records are from wikidata. In fact, if each edit still creates a record, it would account only for 15% of rows. Something else is in play here :/ Can it be dumpers? Can it be someone hitting HTML of our pages? I will dig in this.

In order to do a proper investigation on this (and possible improvements), I suggest we add logging traceback to parsing (sampled 1:128 maybe?). And dive into the data, How does that sound to you @Krinkle

Random notes:

  • Each refreshlinks job parses the page twice.
  • We can also just accept the fact that things are like this and add another section in the future. I personally would like to fix low hanging fruits but we shouldn't spend too much time on it.
  • Another way to turn the knob is to actually look at hit data and cache age, if it turns out for example 99% of cache hits are just to entries that are one week old (and we have a long tail), we can probably just reduce the time without much loss. Having a graph of it would help me to do some math on it (finding a local optima I guess).
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change 761492 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ContentHandler: Pass parsing timestamp to ParserCache::save

https://gerrit.wikimedia.org/r/761492

Change 761419 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.21] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761419

Change 761420 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.20] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761420

Change 761492 merged by jenkins-bot:

[mediawiki/core@master] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761492

Change 761420 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.20] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761420

Mentioned in SAL (#wikimedia-operations) [2022-02-10T18:40:19Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.20/includes/content/ContentHandler.php: Backport: [[gerrit:761420|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s)

Change 761419 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.21] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761419

Mentioned in SAL (#wikimedia-operations) [2022-02-10T18:42:39Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.21/includes/content/ContentHandler.php: Backport: [[gerrit:761419|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s)

So with these fixes, I can confirm the read, write and storage on ParserCache has dropped drastically (this is just from binlog getting smaller):
https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1643987600396&to=1644591181042

image.png (544×1 px, 147 KB)

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=pc1013&var-port=9104&from=1643986467835&to=1644591267835

image.png (538×884 px, 92 KB)

This means we need to wait around a month and re-analyze everything.

Krinkle changed the task status from Stalled to Open.Feb 14 2022, 2:00 PM
Krinkle moved this task from Untriaged to Consider write up on the Performance-Team-publish board.

After twenty days of the CirrusSearch change, now did a number check and number of entries in ParserCache has dropped by 5.6% which is around 27M entries. There are more work that can be done there and I will continue taking a look and work on that.

Change 775390 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/775390

Change 775390 merged by jenkins-bot:

[mediawiki/core@master] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/775390

Change 777388 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.39.0-wmf.5] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/777388

Change 777388 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.5] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/777388

Mentioned in SAL (#wikimedia-operations) [2022-04-05T15:42:04Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.5/includes: Backport: [[gerrit:777388|ParserOutputAccess: Allow calling getPO with option of not saving in PC (T285993)]] (duration: 01m 00s)

We have a decent redaction in PC save. Let's see how it ends up to be in a week or so:

image.png (995×1 px, 165 KB)

Mentioned in SAL (#wikimedia-operations) [2022-04-19T01:02:16Z] <Amir1> turning on general logging in pc1012 (pc2) (T285993)

Mentioned in SAL (#wikimedia-operations) [2022-04-19T01:03:39Z] <Amir1> turning off general logging in pc1012 (pc2) (T285993)

I took a sample of read queries to ParserCache for a bit which gave ~36K read queries and given that I had the rough timestamp of query itself, I replayed it and collected exptimes. The result of distribution of exptime (per hour) is here:

res.png (480×640 px, 12 KB)

The first local optima shows up in around 40 hours (13%), so in extreme measures, you can set the expiry to forty hours and reduce the size of PC to 8% while still keeping 13% of hits.

One really interesting observation is the solid hit rate to expiry between 400 and 500 hours. While it's technically 20% of the PC, it provides 34% of the hits. This probably means we should increase the ttl to thirty days and revisit the value with new sampling. The optima seems to be somewhere above 21 days (but hopefully somewhere below 30 days).