[SPIKE] Estimate growth in demand for Parser Cache storage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Jul 1 2021, 11:08 PM

Description

This task represents the work with estimating how the demand for Parser Cache storage could change over time.

The need for such an estimate emerged in the 1 July 2021 meeting between the Editing, Data Persistence, Performance, and Parsing Teams.

Open question(s)

1. How do we estimate the storage demand being placed on the parser cache will change over time? Note: the changes in said "storage demands" will be driven by parameters edit rate, data retention time, etc.

Done

Answers to all ===Open question(s) are documented within this task's description

Details

Subject	Repo	Branch	Lines +/-
ParserOutputAccess: Allow calling getPO with option of not saving in PC	mediawiki/core	wmf/1.39.0-wmf.5	+15 -11
ParserOutputAccess: Allow calling getPO with option of not saving in PC	mediawiki/core	master	+15 -11
ContentHandler: Avoding saving in ParserCache in search index jobs	mediawiki/core	wmf/1.38.0-wmf.21	+0 -3
ContentHandler: Avoding saving in ParserCache in search index jobs	mediawiki/core	master	+0 -3
ContentHandler: Avoding saving in ParserCache in search index jobs	mediawiki/core	wmf/1.38.0-wmf.20	+0 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	pmiazga	T302623 FY2022-2023: Improve Backend Pageview Timing
Resolved	Marostegui	T280604 Post-deployment: (partly) ramp parser cache retention back up
Resolved	Ladsgroup	T285993 [SPIKE] Estimate growth in demand for Parser Cache storage

Event Timeline

ppelberg created this task.Jul 1 2021, 11:08 PM

ppelberg mentioned this in T280599: Reduce DiscussionTools' usage of the parser cache.Jul 1 2021, 11:24 PM

ppelberg assigned this task to Krinkle.Jul 1 2021, 11:30 PM

ppelberg added subscribers: • DAbad, • marcella.

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).Jul 2 2021, 4:43 PM

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.Jul 2 2021, 4:44 PM

ppelberg edited projects, added Editing-team (Tracking); removed Editing-team.

ppelberg moved this task from Backlog to External on the Editing-team (Tracking) board.

Krinkle edited parent tasks, added: T280604: Post-deployment: (partly) ramp parser cache retention back up ; removed: T280599: Reduce DiscussionTools' usage of the parser cache.Jul 12 2021, 5:57 PM

• dpifke moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Jul 12 2021, 6:38 PM

Krinkle triaged this task as High priority.Jul 19 2021, 6:04 PM

Marking as stalled for now pending parent task's being unblocked. I might still squeeze this in earlier, but it's not urgent right now.

Krinkle reassigned this task from Krinkle to Ladsgroup.Feb 1 2022, 7:01 PM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptFeb 1 2022, 7:01 PM

I spent some time on this. I did a sampling of 1.1% (3/256) and here is some notes.

One big takeaway here is that PC size is not correlated to the size of wikis or reads, it is mostly about Commons and Wikidata. The other half is other wikis. When it comes to records, 29.8% of all of ParserCache entries is commons. To explain how massive it is: It is bigger than the next eleven wikis combined (excluding wikidata, I'll explain below why).

You can basically assume something like this is could give a very rough esitmated number of records of the PC. The current total is 481M entries:

1.33A + B + 0.4C + 0.77D

Where

A is total number of pages in commons. Keep it in mind that is not about images, all pages.
B is total number of pages edited in wikidata in the last month. Currently around 7M.
C is total number of pages in Wikidata. All pages.
D is total number of pages in all wikis except Wikidata and commons.

The 0.77 index is pretty consistent across wikis I checked. But A could be broken down to number of images in commons and number of pages (outside ns6). It would be something like 1.4 for files and 1.1 for the rest.

Commons

Commons is a big mess. We actually improved this with action=render clean up but it didn't have as good as effect as I hoped, for a funny reason. Each wiki is using uselang=, making it fragmented by language. So most of action render fragmentation now turned into language ones. It is good to have it anyway.

If you try to see the size instead of records, the percentage actually increases to 31%, it means average PC entry for example is slightly bigger than the average entries in ParserCache which I think is a bit surprising.

This seems to be combination of multiple factors here leading to this massive storage:

a lot of refresh links jobs being triggered.
- Commons have a lot of templates that are widely used and any change on them would cause a massive increase. One really good to tackle this is to move documentation of templates to its own slot.
- A lot of refreshlinks job is being also triggered from wikidata which enforce the fact that we shouldn't enable rc injection for commons (T179010) but we can take a look at this a bit harder
  - Revisit and check Lua modules for wikidata to see if there is anything that we can improve
  - Making the tracking more granular (I'm not sure if that would help, the statement tracking is already disabled but maybe that triggers a job still?) needs a bit of investigation.
(I had some ideas about images but I forgot)

Wikidata

The number that should be in PC for this should be pretty low. We have made changes that avoids parsing and storing pages when a page is edited (assuming most of them are bot edits which don't need the parsed page) and yet still 10% of all PC records are from wikidata. In fact, if each edit still creates a record, it would account only for 15% of rows. Something else is in play here :/ Can it be dumpers? Can it be someone hitting HTML of our pages? I will dig in this.

In order to do a proper investigation on this (and possible improvements), I suggest we add logging traceback to parsing (sampled 1:128 maybe?). And dive into the data, How does that sound to you @Krinkle

Random notes:

Each refreshlinks job parses the page twice.
We can also just accept the fact that things are like this and add another section in the future. I personally would like to fix low hanging fruits but we shouldn't spend too much time on it.
Another way to turn the knob is to actually look at hit data and cache age, if it turns out for example 99% of cache hits are just to entries that are one week old (and we have a long tail), we can probably just reduce the time without much loss. Having a graph of it would help me to do some math on it (finding a local optima I guess).

Ladsgroup added a project: DBA.Feb 9 2022, 6:32 PM

Ladsgroup moved this task from Triage to In progress on the DBA board.

Change 761492 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ContentHandler: Pass parsing timestamp to ParserCache::save

https://gerrit.wikimedia.org/r/761492

gerritbot added a project: Patch-For-Review.Feb 9 2022, 11:15 PM

Change 761419 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.21] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761419

Change 761420 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.20] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761420

Change 761492 merged by jenkins-bot:

[mediawiki/core@master] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761492

Change 761420 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.20] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761420

Mentioned in SAL (#wikimedia-operations) [2022-02-10T18:40:19Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.20/includes/content/ContentHandler.php: Backport: [[gerrit:761420|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s)

Change 761419 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.21] ContentHandler: Avoding saving in ParserCache in search index jobs

https://gerrit.wikimedia.org/r/761419

Mentioned in SAL (#wikimedia-operations) [2022-02-10T18:42:39Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.21/includes/content/ContentHandler.php: Backport: [[gerrit:761419|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s)

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.20; 2022-01-31).Feb 10 2022, 7:00 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 10 2022, 7:12 PM

So with these fixes, I can confirm the read, write and storage on ParserCache has dropped drastically (this is just from binlog getting smaller):
https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1643987600396&to=1644591181042

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=pc1013&var-port=9104&from=1643986467835&to=1644591267835

This means we need to wait around a month and re-analyze everything.

Krinkle changed the task status from Stalled to Open.Feb 14 2022, 2:00 PM

Krinkle added a project: Wikimedia-Performance-publish.

Krinkle moved this task from Untriaged to Ready for write-up on the Wikimedia-Performance-publish board.

EBernhardson subscribed.Feb 25 2022, 8:38 PM

After twenty days of the CirrusSearch change, now did a number check and number of entries in ParserCache has dropped by 5.6% which is around 27M entries. There are more work that can be done there and I will continue taking a look and work on that.

Ladsgroup moved this task from Blocked to In progress on the DBA board.Mar 7 2022, 10:09 AM

Change 775390 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/775390

gerritbot added a project: Patch-For-Review.Mar 30 2022, 8:24 PM

Change 775390 merged by jenkins-bot:

[mediawiki/core@master] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/775390

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.6; 2022-04-04).Apr 1 2022, 3:00 PM

Esanders unsubscribed.Apr 5 2022, 1:52 PM

Change 777388 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.39.0-wmf.5] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/777388

Change 777388 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.5] ParserOutputAccess: Allow calling getPO with option of not saving in PC

https://gerrit.wikimedia.org/r/777388

Mentioned in SAL (#wikimedia-operations) [2022-04-05T15:42:04Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.5/includes: Backport: [[gerrit:777388|ParserOutputAccess: Allow calling getPO with option of not saving in PC (T285993)]] (duration: 01m 00s)

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.5; 2022-03-28); removed MW-1.39-notes (1.39.0-wmf.6; 2022-04-04).Apr 5 2022, 4:00 PM

We have a decent redaction in PC save. Let's see how it ends up to be in a week or so:

Mentioned in SAL (#wikimedia-operations) [2022-04-19T01:02:16Z] <Amir1> turning on general logging in pc1012 (pc2) (T285993)

Mentioned in SAL (#wikimedia-operations) [2022-04-19T01:03:39Z] <Amir1> turning off general logging in pc1012 (pc2) (T285993)

I took a sample of read queries to ParserCache for a bit which gave ~36K read queries and given that I had the rough timestamp of query itself, I replayed it and collected exptimes. The result of distribution of exptime (per hour) is here:

The first local optima shows up in around 40 hours (13%), so in extreme measures, you can set the expiry to forty hours and reduce the size of PC to 8% while still keeping 13% of hits.

One really interesting observation is the solid hit rate to expiry between 400 and 500 hours. While it's technically 20% of the PC, it provides 34% of the hits. This probably means we should increase the ttl to thirty days and revisit the value with new sampling. The optima seems to be somewhere above 21 days (but hopefully somewhere below 30 days).

DAlangi_WMF subscribed.May 17 2022, 3:15 PM

Krinkle mentioned this in T301371: Preemptively warm caches for Parsoid output.May 19 2022, 4:47 PM

Bolding closing as this seems adequate for the task at hand. Per your last comment's conclusion, let's continue at T280604: Post-deployment: (partly) ramp parser cache retention back up .

Krinkle mentioned this in T280604: Post-deployment: (partly) ramp parser cache retention back up .Jun 7 2022, 4:55 PM

Maintenance_bot moved this task from In progress to Done on the DBA board.Jun 7 2022, 5:29 PM

	F35057482: res.png
	Apr 19 2022, 2:24 AM

	F35040914: image.png
	Apr 6 2022, 9:31 AM

	F34948514: image.png
	Feb 11 2022, 2:56 PM

	F34948524: image.png
	Feb 11 2022, 2:56 PM

[SPIKE] Estimate growth in demand for Parser Cache storageClosed, ResolvedPublicActions