Page MenuHomePhabricator

Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression)
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1626553020741&to=1629145020742

Last 30 days
Screenshot 2021-08-16 at 21.17.08.png (1×808 px, 145 KB)
Last week
Screenshot 2021-08-16 at 21.16.10.png (1×2 px, 432 KB)
12 Aug 2021
Screenshot 2021-08-16 at 21.32.18.png (1×2 px, 369 KB)

See also:


\cc @ppelberg and Editing-team. I don't suspect DiscussionTools at this time, but this is for your general awareness as a concerned third party around ParserCache at the moment (T280599, T280606). Though if your team does know of a potential cause relating to one of your features consuming more parser cache space as of last week, let us know here!

Related Objects

Event Timeline

Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Marostegui renamed this task from Significant ParserCache space increase after 2021-08-21 (1.37.0-wmf.18 regression) to Significant ParserCache space increase after 2021-07-21 (1.37.0-wmf.18 regression).Aug 17 2021, 4:16 AM
Marostegui renamed this task from Significant ParserCache space increase after 2021-07-21 (1.37.0-wmf.18 regression) to Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression).Aug 17 2021, 4:26 AM

I can see a new pattern on their writes after the 12th:

Captura de pantalla 2021-08-17 a las 6.29.27.png (737×1 px, 174 KB)

Captura de pantalla 2021-08-17 a las 6.28.47.png (686×1 px, 166 KB)

Those spikes on INSERTs sort of match the MW deployments seen at https://sal.toolforge.org/production?p=0&q=&d=2021-08-12
Unfortunately I cannot longer try to see what changed on the binlogs that day as we only store 24h of binlogs (due to disk space constraints).

What looks very strange is that we no longer hace REPLACE statements on the binlogs and we only have INSERTs now, this is definitely new.
My guess is that instead of replacing rows we are simply adding new ones and hence the increase.

This is how the current inserts look like:

INSERT /* SqlBagOStuff::modifyTableSpecificBlobsForSet  */ INTO `pc012` (keyname,value,exptime) VALUES ('wikidatawiki:pcache:idoptions:103391394','E<8C><B1>\n<C2>0^TE<FF><E5><CE>^Y<F2><92>@c^\<C5><C1>I^Q<9D>K<B5>^O\rhESC<92>T-<A5><FF>n<C0><82><DB><E5>\\<CE>I<8E><CC><CA>a¡
<89><89>ω<DB>}Ⱦ<EF>^R܄!q|4<DD>\r.ǁ^Eޗߚ^E6<CD><F5><CE><DB>O<F0>q<84>#K<C6>H<B9>Г^?2^\<94>T$-U<D2>h\"<89><E5><<F2>˧<D2>ߵE3Vi])\"<81>:<8F><81><EB><A2><FD>^S3<D6>_','20210907043112') ON DUPLICATE KEY UPDATE value='E<8C><B1>\n<C2>0^TE<FF><E5><CE>^Y<F2><92>@c^\<C5><C1>I^Q<9D>K
<B5>^O\rhESC<92>T-<A5><FF>n<C0><82><DB><E5>\\<CE>I<8E><CC><CA>a¡<89><89>ω<DB>}Ⱦ<EF>^R܄!q|4<DD>\r.ǁ^Eޗߚ^E6<CD><F5><CE><DB>O<F0>q<84>#K<C6>H<B9>Г^?2^\<94>T$-U<D2>h\"<89><E5><<F2>˧<D2>ߵE3Vi])\"<81>:<8F><81><EB><A2><FD>^S3<D6>_',exptime='20210907043112'
/*!*/;

And by looking at an old binlog we had stored somewhere (it is from 2018 but the format/statements haven't changed that much until today):

REPLACE /* SqlBagOStuff::setMulti  */ INTO `pc072` (keyname,value,exptime) VALUES ('commonswiki:pcache:idoptions:16041943','E<8E><C1>\n<C2>0^PD<FF>e?@<B2>iR<EB><F6>(^^<^UD<BD>W<BB><98><80><AD><92>mE-<FE><BB>IU<BC>\ro<86><99><A9>hA<B0><AC><8F><8E><B7><BE>e K<A3>^Pj<82>v\'
<DC>T<D7><DE>_:<81><B2>&C<A3>\'U\n^U^D<83>p8<D7><DD>     JO^XQ<AC><E8><DD><D0>^^<C4>?91^]^Yf^D<D2>^O<87><DE>^E^Vw97<C9>Ȓ<BA>^X<E6><E0><B8>n|w<8A>����<DD>s<90><B8>^G1e      p<96><CF>LҨ<A2><F7><FF><F8><ED><D0>\n^KD̔չ<B2>^Sտ<DC><EA>~<F5><E1>1-<E6>*<9D><C6><FC>gm<F8><E6><D3><C8>

Have we changed from REPLACE to INSERT for any reason?
Looking at that changelog, the only parsercache related change I can see is: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709812/3/includes/parser/PPFrame_Hash.php

Marostegui moved this task from Triage to In progress on the DBA board.

This is the graph of pc2009, which is not affected by any changes of the new hosts, as the new host is only serving in pc1 and pc2009 belongs to pc3.

I have zoomed in the traffic change graph: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&var-server=pc2009&var-port=9104&from=1628692800103&to=1628822904589

Captura de pantalla 2021-08-17 a las 9.56.30.png (798×1 px, 226 KB)

The first spike doesn't seem to match any code push according to https://sal.toolforge.org/production?p=0&q=&d=2021-08-12 it seems to happen at 19:30, at the first push is at 19:34, so unless there's a time mismatch there...
The second one though matches almost perfectly with this:

21:30	<twentyafterfour@deploy1002>	rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.18 refs T281159

I am pretty sure it is: https://phabricator.wikimedia.org/rMWd56a686f83eb7d19abe9aced0af1985f25025901 which perf team designed :-) (according to history @aaron wrote and @Krinkle merged).

That has changed what used to be replace() calls with upsert() calls, which just do INSERTs. It could be an underlying bug on upsert() or the parsecache tables, but I would restore the replaces, as it breaks multidc agreed logic.

Change 713364 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713364

I'm planning to merge the revert and backport it ASAP if DBAs are fine with that.

Not a DBA, but please note above I was suggesting restoring the replace() behavior, not necessarily a full revert as maybe there is a chance for a smaller change instead? I don't have any opinion, just wanted to clarify my previous comment. I'd prefer if perf team could weight on it, if possible.

I'm planning to merge the revert and backport it ASAP if DBAs are fine with that.

I would suggest to talk to Aaron or Timo first and if nothing has been found or the fix would take time and won't be ready before the weekend then revert before Friday.

Timo suggested to revert for now, they are quite busy right now it seems.

Change 713365 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.37.0-wmf.19] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713365

Change 713506 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.37.0-wmf.18] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713506

Yep, full revert please. This is a 1000-line refactor for T212129 that mainly revolves around changing which queries we do. There is no simpler version that would still use replace().

Change 713364 merged by jenkins-bot:

[mediawiki/core@master] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713364

Change 713365 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.19] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713365

Change 713506 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.18] Revert \"objectcache: make use of new `modtoken` field in SqlBagOStuff\"

https://gerrit.wikimedia.org/r/713506

Mentioned in SAL (#wikimedia-operations) [2021-08-17T17:39:11Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.19/includes/: Backport: [[gerrit:713365|Revert "objectcache: make use of new modtoken field in SqlBagOStuff" (T288998)]] (duration: 01m 14s)

Mentioned in SAL (#wikimedia-operations) [2021-08-17T17:44:00Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.18/includes/: Backport: [[gerrit:713506|Revert "objectcache: make use of new modtoken field in SqlBagOStuff" (T288998)]] (duration: 01m 13s)

After the revert the disk space has gone back to stable values and started to decrease again (I assume after the purges complete):
https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1629089764998&to=1629262564998

Captura de pantalla 2021-08-18 a las 6.57.18.png (567×1 px, 106 KB)

The REPLACE vs INSERTS traffic pattern went back after the revert too

I would suggest to wait until Monday to see if the disk pattern keep this way and if so, close this as fixed.

I created T289110 to be used as a follow up code troubleshooting task so we can close this one. @Krinkle feel free to merge as duplicate if you already have one.

I'll retry the patch based on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/574101 (which would let me avoid repeating the blob in the ON DUPLICATE KEY UPDATE SET clause). That should avoid the binlog bloat.

@aaron Can we hold off on deploying this patch until we are done with T280599 (the Discussion Tools deployment, not necessarily the retention increase)? Editing have been waiting for their turn for quite a while now and I'd rather avoid introducing additional variables into the mix.

Krinkle claimed this task.

Space went back down after 24h in the graphs, which confirms the binlog theory, and it has stayed down since then. Marking as resolved.