Page MenuHomePhabricator

RevisionStore::newRevisionSlots() needs a cache
Closed, ResolvedPublic

Description

Parsing article of Alan Turing in English Wikipedia currently causes 325 queries list of them of which 207 are from MediaWiki\Revision\RevisionStore::loadSlotRecords. The function that calls this function (RevisionStore::newRevisionSlots) even has this note:

// XXX: do we need the same kind of caching here

It seems we indeed do need some caching for this, probably two layers, one WAN (long TTL) and one APCu (short TTL)

Event Timeline

Change 744097 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Revision: Wrap loading slot info with two layers of cache

https://gerrit.wikimedia.org/r/744097

Ladsgroup triaged this task as Medium priority.Dec 6 2021, 9:23 PM
Ladsgroup moved this task from Triage to In progress on the DBA board.

I will test this tomorrow ^

This area has been refactored a lot over the past 1-2 years. It may be worth checking what this code looked like 2y ago in Git, and see if there was caching around here, or perhaps higher up in the stack in a more efficient way rather than exactly here.

The reason I suspect there was caching here, is that this is quite a common use case and core to how MW works, with seemingly high impact. It seems likely as such that these queries happening are one of the contributing factors to the major save timing regression we've seen.

I have no objection to adding caching here, but I think it is worth understanding how it worked before as the loss of caching may not be the only thing lost in the refactor. There is likely other information and consideration or functional problems etc part of the same loss and you'd want to know about and could lead to a better understanding and/or outcome.

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

I did not consciously remove any caching during the refactor, other than perhaps fields on the Revision object that used to be lazy-loaded and cached, and are not present on RevisionRecord. But I can't think of any.

That doesn't mean of course that I didn't inadvertently remove some caching, or perhaps I'm misremembering. Just sharing my recollection fwiw.

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

I did not consciously remove any caching during the refactor, other than perhaps fields on the Revision object that used to be lazy-loaded and cached, and are not present on RevisionRecord. But I can't think of any.

This function (from my understanding) returns slot information from a given revision. It didn't need caching before mcr refactors.

That doesn't mean of course that I didn't inadvertently remove some caching, or perhaps I'm misremembering. Just sharing my recollection fwiw.

I think something is wrong in the function still, private function newRevisionSlots takes $revisionRow but doesn't use it at all.

And to be clear: I'm not criticizing the work. It was a massive endeavor and always things fell into cracks in such large changes. It's 100% fine. My problem is a general note that no one audits our systems in common requests to see if they are making fishy db queries, too many network calls, etc. etc.

I think I sorta found the underlying problem. getRevisionByTitle which is widely used merges everything it needs into one query but doesn't add the slots which triggers and extra query before being able to even look at the content from cache. I dig more.

I take that back, I don't think it's possible to fix this without major refactors of RevisionStore class which I would be more than happy to do once we can simplify its logic by removing actor and comment migration bits and pieces. For the future reference the problem boils down to newRevisionFromRowAndSlots only getting $slots parameter set in dumper and in the rest of cases, it's just null and callers should load the revision row in the same query with loading slots. But that's for future.

Tested the patch in mwdebug1002 on purging Alan Turing. Went from 451 queries (other caches being cache cold)-437 (other caches being cache warm) to 232-228. With the number of purges we have, this translates to at least 1M less queries per second.

The 207 calls are needed for each template/module to get the main slot content (91 modules + 114 templates in page Alan Turing). Using Parser::statelessFetchRevisionRecord, which is using RevisionStore::getKnownCurrentRevision. There is no slot support in that function.

But this can only happen for the main slot. When there are more slots on a page the result would have more rows, which are not collected by the used IDatabase::selectRow. RevisionStore would not lazy load more slots if requested. RevisionRenderer is requesting all slots from the RevisionRecord used in the Parser, in a MCR world the select of all slots could not be avoided.

Parser could prefetch the transcluded templates with all main slots instead, but batching the Parser would be a harder work than everything else.

Or RevisionStore::getKnownCurrentRevision is caching the slots (in RevisionStore::loadSlotRecords) also for a week like the fetched revision. But that is already done in your patch set.

The 207 calls are needed for each template/module to get the main slot content (91 modules + 114 templates in page Alan Turing). Using Parser::statelessFetchRevisionRecord, which is using RevisionStore::getKnownCurrentRevision. There is no slot support in that function.

But this can only happen for the main slot. When there are more slots on a page the result would have more rows, which are not collected by the used IDatabase::selectRow. RevisionStore would not lazy load more slots if requested. RevisionRenderer is requesting all slots from the RevisionRecord used in the Parser, in a MCR world the select of all slots could not be avoided.

Parser could prefetch the transcluded templates with all main slots instead, but batching the Parser would be a harder work than everything else.

That's really complicated in the sense that we don't know what the transcluded page might bring and most of the time they bring more pages to be transcluded.

Or RevisionStore::getKnownCurrentRevision is caching the slots (in RevisionStore::loadSlotRecords) also for a week like the fetched revision. But that is already done in your patch set.

So you think my patch is good to go? :D

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. […]

I agree it's quite likely that we would not have noticed DB queries like you did. However, save timing has definitely regressed during the various core refactors after that, which we did notice, so it seems plausible that some undocumented optimisation or caching was lost along the way, possibly around the area being described in this ticket.

As such, I think it may be useful to look at how this worked before all that to perhaps uncover a logical requirement or uncover a more optimal way to cache this that may have been lost, rather than start from scratch. I suspect that if there was some subtle reason for not doing it this way, that we'd likely not immediatley find out, but it also seems complex enough that it coudl easily cause corruption in some way that we'd not notice for a good while. I'd just like to rule it out.

Of course, its very possible we'll find nothing, and that it won't cause any subtle issue, and that the perf regression(s) are caused by other changes. The change at glance, LGTM.

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. […]

I agree it's quite likely that we would not have noticed DB queries like you did. However, save timing has definitely regressed during the various core refactors after that, which we did notice, so it seems plausible that some undocumented optimisation or caching was lost along the way, possibly around the area being described in this ticket.

As such, I think it may be useful to look at how this worked before all that to perhaps uncover a logical requirement or uncover a more optimal way to cache this that may have been lost, rather than start from scratch. I suspect that if there was some subtle reason for not doing it this way, that we'd likely not immediatley find out, but it also seems complex enough that it coudl easily cause corruption in some way that we'd not notice for a good while. I'd just like to rule it out.

Of course, its very possible we'll find nothing, and that it won't cause any subtle issue, and that the perf regression(s) are caused by other changes. The change at glance, LGTM.

That is what I did after your comment. I went through the code in depth and first I thought we could fix it without adding cache but then I realized it's not possible without a major refactor of the function: T297147#7563670

Change 747699 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.13] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747699

Change 747700 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.12] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747700

Change 744097 merged by jenkins-bot:

[mediawiki/core@master] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/744097

Change 747699 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.13] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747699

Change 747700 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.12] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747700

Mentioned in SAL (#wikimedia-operations) [2021-12-16T16:55:02Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.13/includes/Revision/RevisionStore.php: Backport: [[gerrit:747699|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2021-12-16T16:56:52Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.12/includes/Revision/RevisionStore.php: Backport: [[gerrit:747700|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s)

Measurement time:
The markers show the time the patches got deployed

image.png (968×1 px, 120 KB)

and
image.png (968×1 px, 256 KB)

The impact is noticeable. 50MB/s removed from db traffic and between 20K-50K ops/s reduced as well. This won't stay like this and will get reduced further until next week. The WAN cache hit rate is already at 70% but given the TTL of a week. I assume it'll reach much higher hitrate and reduce even more load.

It's a bit hard to measure APCu cache impact but I'm trying.

And APCu impact is also good showing redaction in latency of saving edits

image.png (968×1 px, 205 KB)

Exactly after the deployment the current value goes below the value of last week while it was the oppose until the deployment.

Change 748180 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Revision: Inject local cache to RevisionStore

https://gerrit.wikimedia.org/r/748180

Change 748180 merged by jenkins-bot:

[mediawiki/core@master] Revision: Inject local cache to RevisionStore

https://gerrit.wikimedia.org/r/748180

Ladsgroup moved this task from In progress to Done on the DBA board.

I don't think there is anything missing here, we should create tickets for the follow ups.