RevisionStore::newRevisionSlots() needs a cache
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Dec 6 2021, 8:39 PM

Description

Parsing article of Alan Turing in English Wikipedia currently causes 325 queries list of them of which 207 are from MediaWiki\Revision\RevisionStore::loadSlotRecords. The function that calls this function (RevisionStore::newRevisionSlots) even has this note:

// XXX: do we need the same kind of caching here

It seems we indeed do need some caching for this, probably two layers, one WAN (long TTL) and one APCu (short TTL)

Details

Subject	Repo	Branch	Lines +/-
Revision: Inject local cache to RevisionStore	mediawiki/core	master	+26 -6
Revision: Add two caching layers to loadSlotRecords for template pages	mediawiki/core	wmf/1.38.0-wmf.12	+46 -5
Revision: Add two caching layers to loadSlotRecords for template pages	mediawiki/core	wmf/1.38.0-wmf.13	+46 -5
Revision: Add two caching layers to loadSlotRecords for template pages	mediawiki/core	master	+46 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T255502 Goal: Save Timing median back under 1 second
Resolved	Krinkle	T277788 Save Timing improvements (2021-2022)
Resolved	Ladsgroup	T297147 RevisionStore::newRevisionSlots() needs a cache

Event Timeline

Ladsgroup created this task.Dec 6 2021, 8:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 6 2021, 8:39 PM

Ladsgroup added subscribers: daniel, • Pchelolo.Dec 6 2021, 9:20 PM

Change 744097 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Revision: Wrap loading slot info with two layers of cache

https://gerrit.wikimedia.org/r/744097

gerritbot added a project: Patch-For-Review.Dec 6 2021, 9:22 PM

I will test this tomorrow ^

Ladsgroup claimed this task.Dec 6 2021, 9:24 PM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptDec 6 2021, 9:24 PM

This area has been refactored a lot over the past 1-2 years. It may be worth checking what this code looked like 2y ago in Git, and see if there was caching around here, or perhaps higher up in the stack in a more efficient way rather than exactly here.

The reason I suspect there was caching here, is that this is quite a common use case and core to how MW works, with seemingly high impact. It seems likely as such that these queries happening are one of the contributing factors to the major save timing regression we've seen.

I have no objection to adding caching here, but I think it is worth understanding how it worked before as the loss of caching may not be the only thing lost in the refactor. There is likely other information and consideration or functional problems etc part of the same loss and you'd want to know about and could lead to a better understanding and/or outcome.

Krinkle added a project: Performance-Team (Radar).Dec 8 2021, 3:05 PM

Krinkle added a parent task: T277788: Save Timing improvements (2021-2022).

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

In T297147#7563305, @Ladsgroup wrote:

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

I did not consciously remove any caching during the refactor, other than perhaps fields on the Revision object that used to be lazy-loaded and cached, and are not present on RevisionRecord. But I can't think of any.

That doesn't mean of course that I didn't inadvertently remove some caching, or perhaps I'm misremembering. Just sharing my recollection fwiw.

In T297147#7563433, @daniel wrote:

In T297147#7563305, @Ladsgroup wrote:

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. I have seen similar issues everywhere.

I did not consciously remove any caching during the refactor, other than perhaps fields on the Revision object that used to be lazy-loaded and cached, and are not present on RevisionRecord. But I can't think of any.

This function (from my understanding) returns slot information from a given revision. It didn't need caching before mcr refactors.

That doesn't mean of course that I didn't inadvertently remove some caching, or perhaps I'm misremembering. Just sharing my recollection fwiw.

I think something is wrong in the function still, private function newRevisionSlots takes $revisionRow but doesn't use it at all.

And to be clear: I'm not criticizing the work. It was a massive endeavor and always things fell into cracks in such large changes. It's 100% fine. My problem is a general note that no one audits our systems in common requests to see if they are making fishy db queries, too many network calls, etc. etc.

I think I sorta found the underlying problem. getRevisionByTitle which is widely used merges everything it needs into one query but doesn't add the slots which triggers and extra query before being able to even look at the content from cache. I dig more.

I take that back, I don't think it's possible to fix this without major refactors of RevisionStore class which I would be more than happy to do once we can simplify its logic by removing actor and comment migration bits and pieces. For the future reference the problem boils down to newRevisionFromRowAndSlots only getting $slots parameter set in dumper and in the rest of cases, it's just null and callers should load the revision row in the same query with loading slots. But that's for future.

Tested the patch in mwdebug1002 on purging Alan Turing. Went from 451 queries (other caches being cache cold)-437 (other caches being cache warm) to 232-228. With the number of purges we have, this translates to at least 1M less queries per second.

The 207 calls are needed for each template/module to get the main slot content (91 modules + 114 templates in page Alan Turing). Using Parser::statelessFetchRevisionRecord, which is using RevisionStore::getKnownCurrentRevision. There is no slot support in that function.

But this can only happen for the main slot. When there are more slots on a page the result would have more rows, which are not collected by the used IDatabase::selectRow. RevisionStore would not lazy load more slots if requested. RevisionRenderer is requesting all slots from the RevisionRecord used in the Parser, in a MCR world the select of all slots could not be avoided.

Parser could prefetch the transcluded templates with all main slots instead, but batching the Parser would be a harder work than everything else.

Or RevisionStore::getKnownCurrentRevision is caching the slots (in RevisionStore::loadSlotRecords) also for a week like the fetched revision. But that is already done in your patch set.

In T297147#7564330, @Umherirrender wrote:

The 207 calls are needed for each template/module to get the main slot content (91 modules + 114 templates in page Alan Turing). Using Parser::statelessFetchRevisionRecord, which is using RevisionStore::getKnownCurrentRevision. There is no slot support in that function.

But this can only happen for the main slot. When there are more slots on a page the result would have more rows, which are not collected by the used IDatabase::selectRow. RevisionStore would not lazy load more slots if requested. RevisionRenderer is requesting all slots from the RevisionRecord used in the Parser, in a MCR world the select of all slots could not be avoided.

Parser could prefetch the transcluded templates with all main slots instead, but batching the Parser would be a harder work than everything else.

That's really complicated in the sense that we don't know what the transcluded page might bring and most of the time they bring more pages to be transcluded.

Or RevisionStore::getKnownCurrentRevision is caching the slots (in RevisionStore::loadSlotRecords) also for a week like the fetched revision. But that is already done in your patch set.

So you think my patch is good to go? :D

In T297147#7563305, @Ladsgroup wrote:

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. […]

I agree it's quite likely that we would not have noticed DB queries like you did. However, save timing has definitely regressed during the various core refactors after that, which we did notice, so it seems plausible that some undocumented optimisation or caching was lost along the way, possibly around the area being described in this ticket.

As such, I think it may be useful to look at how this worked before all that to perhaps uncover a logical requirement or uncover a more optimal way to cache this that may have been lost, rather than start from scratch. I suspect that if there was some subtle reason for not doing it this way, that we'd likely not immediatley find out, but it also seems complex enough that it coudl easily cause corruption in some way that we'd not notice for a good while. I'd just like to rule it out.

Of course, its very possible we'll find nothing, and that it won't cause any subtle issue, and that the perf regression(s) are caused by other changes. The change at glance, LGTM.

In T297147#7568302, @Krinkle wrote:

In T297147#7563305, @Ladsgroup wrote:

Honestly, given that no one actually looks at queries made in requests, this could have been like this since the mcr refactor and we didn't notice it until now. […]

I agree it's quite likely that we would not have noticed DB queries like you did. However, save timing has definitely regressed during the various core refactors after that, which we did notice, so it seems plausible that some undocumented optimisation or caching was lost along the way, possibly around the area being described in this ticket.

As such, I think it may be useful to look at how this worked before all that to perhaps uncover a logical requirement or uncover a more optimal way to cache this that may have been lost, rather than start from scratch. I suspect that if there was some subtle reason for not doing it this way, that we'd likely not immediatley find out, but it also seems complex enough that it coudl easily cause corruption in some way that we'd not notice for a good while. I'd just like to rule it out.

Of course, its very possible we'll find nothing, and that it won't cause any subtle issue, and that the perf regression(s) are caused by other changes. The change at glance, LGTM.

That is what I did after your comment. I went through the code in depth and first I thought we could fix it without adding cache but then I realized it's not possible without a major refactor of the function: T297147#7563670

Change 747699 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.13] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747699

Change 747700 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.12] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747700

Change 744097 merged by jenkins-bot:

[mediawiki/core@master] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/744097

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.16; 2022-01-03).Dec 16 2021, 3:00 PM

Change 747699 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.13] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747699

Change 747700 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.12] Revision: Add two caching layers to loadSlotRecords for template pages

https://gerrit.wikimedia.org/r/747700

Mentioned in SAL (#wikimedia-operations) [2021-12-16T16:55:02Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.13/includes/Revision/RevisionStore.php: Backport: [[gerrit:747699|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2021-12-16T16:56:52Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.12/includes/Revision/RevisionStore.php: Backport: [[gerrit:747700|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s)

ReleaseTaggerBot edited projects, added MW-1.38-notes (1.38.0-wmf.13; 2021-12-13); removed MW-1.38-notes (1.38.0-wmf.16; 2022-01-03).Dec 16 2021, 5:00 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 16 2021, 5:11 PM

Measurement time:
The markers show the time the patches got deployed

and

The impact is noticeable. 50MB/s removed from db traffic and between 20K-50K ops/s reduced as well. This won't stay like this and will get reduced further until next week. The WAN cache hit rate is already at 70% but given the TTL of a week. I assume it'll reach much higher hitrate and reduce even more load.

It's a bit hard to measure APCu cache impact but I'm trying.

And APCu impact is also good showing redaction in latency of saving edits

Exactly after the deployment the current value goes below the value of last week while it was the oppose until the deployment.

Zabe subscribed.Dec 16 2021, 6:05 PM

Change 748180 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Revision: Inject local cache to RevisionStore

https://gerrit.wikimedia.org/r/748180

gerritbot added a project: Patch-For-Review.Dec 17 2021, 5:57 PM

Change 748180 merged by jenkins-bot:

[mediawiki/core@master] Revision: Inject local cache to RevisionStore

https://gerrit.wikimedia.org/r/748180

ReleaseTaggerBot edited projects, added MW-1.38-notes (1.38.0-wmf.16; 2022-01-03); removed MW-1.38-notes (1.38.0-wmf.13; 2021-12-13).Dec 20 2021, 11:00 AM

I don't think there is anything missing here, we should create tickets for the follow ups.

Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Dec 21 2021, 10:15 AM

	F34886694: image.png
	Dec 16 2021, 6:00 PM

	F34886686: image.png
	Dec 16 2021, 5:49 PM

	F34886688: image.png
	Dec 16 2021, 5:49 PM

RevisionStore::newRevisionSlots() needs a cacheClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

RevisionStore::newRevisionSlots() needs a cache
Closed, ResolvedPublic
Actions

Related Objects
Search...