Strategy for storing parser output for "old revision" (Popular diffs and permalinks)
Closed, ResolvedPublic
Actions

Description

Page https://en.wikipedia.org/w/index.php?title=European_Union&type=revision&diff=938561921&oldid=938557616 became quite popular due to a reddit comment, and we have noticed that it takes about 15s to render

Profiling data: https://phabricator.wikimedia.org/P10297

Both are running 1.35.0-wmf.16 at the moment.

Details

	Subject	Repo	Branch	Lines +/-
	Clean up PoolWorkArticleView	mediawiki/core	master	+176 -189

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T244058 Strategy for storing parser output for "old revision" (Popular diffs and permalinks)
Resolved	daniel	T267234 Introduce a service object for obtaining rendered output for a page
Resolved	daniel	T267832 Refactor PoolWorkArticleView to support caching of old revisions
Resolved	• Pchelolo	T268075 Enable caching for old revisions in production
Resolved	• Pchelolo	T268278 Make storage for old revision parser cache configurable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jijiki updated the task description. (Show Details)Feb 5 2020, 9:45 PM

In T244058#5851362, @Joe wrote:

In T244058#5849290, @aaron wrote:

Links to old (non-current) versions due not use the parser cache. This means that rendering will always require a full parse.

[CUT]

Some sort of parser caching could be considered for old links that get high traffic. We do not want to waste space on pcXXXX mariadb servers nor LRU flood memcached with large blobs, so there would have to be some "hotness" estimation logic, like a hitcounter.

Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per user or IP, probably via poolcounter.

Note that even old version parses for such page views go through PoolWorkArticleView. They just don't use memcached.

What about tuning HTTP frontend caching? Last revision is very dynamic, but a hardcoded diff maybe could return stale results more often, as I can only see it changing on deletion/page move?

In T244058#5861637, @aaron wrote:

In T244058#5851362, @Joe wrote:

In T244058#5849290, @aaron wrote:

Links to old (non-current) versions due not use the parser cache. This means that rendering will always require a full parse.

[CUT]

Some sort of parser caching could be considered for old links that get high traffic. We do not want to waste space on pcXXXX mariadb servers nor LRU flood memcached with large blobs, so there would have to be some "hotness" estimation logic, like a hitcounter.

Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per user or IP, probably via poolcounter.

Note that even old version parses for such page views go through PoolWorkArticleView. They just don't use memcached.

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

RhinosF1 mentioned this in T244636: Some articles time out at nowiki when logged in: Error: 504, Connection Timed Out.Feb 8 2020, 9:57 PM

aaron removed aaron as the assignee of this task.Feb 10 2020, 9:28 PM

aaron added a project: Platform Engineering.

aaron moved this task from Doing (old) to Radar on the Performance-Team board.

aaron edited projects, added Performance-Team (Radar); removed Performance-Team.

aaron subscribed.

In T244058#5861825, @Joe wrote:

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

While that seems a valid request in general, I don't think it would have any effect on this task. This is about many users/IPs hitting something, not just one.

CDanis subscribed.Feb 11 2020, 4:13 PM

In T244058#5872430, @Anomie wrote:

In T244058#5861825, @Joe wrote:

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

While that seems a valid request in general, I don't think it would have any effect on this task. This is about many users/IPs hitting something, not just one.

Correct - we have two issues we've seen and I coalesced the two. We've had scrapers requiring a lot of diffs all at once.

WDoranWMF assigned this task to Krinkle.Feb 12 2020, 7:58 PM

WDoranWMF edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.

WDoranWMF moved this task from Inbox to Ready (WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Feb 14 2020, 6:35 PM

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).Feb 18 2020, 12:53 AM

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

I've gone through dozens of very old diffs from unpopular pages (hoping for a cache miss). From what I can tell, the slowdown is mainly from the page rendering of the old revision below the diff. The majority of time is not spent in diff logic, and indeed I was unable to reproduce a timeout on any additional obscure diffs I viewed once I started using diffonly=1 (also available in the user preferences).

This narrows down the issue to wikitext parsing of non-current revisions which affects any permalink (e.g. with oldid query) regardless of whether it has a diff on top.

As example:

Some further brain storming?

ParserCache?

Do we allow old revisions to be stored in ParserCache?

Allowing this in general might be risky, as it could grow uncontrollably e.g. when every revision ever is being crawle (afaik we currently don't have LRU in ParserCache, we only evict stuff based on 30-day TTL).

Memcache? Separate ParserCache?

If not storing in ParserCache generally, then when, how and where?

Memcached doesn't seem suitable for storing thousands/millions of large multi-megabyte blobs that have a high chance of never being used again. It would fill up the backends and evict more important stuff.

Perhaps an isolated "mini" ParserCache cluster for this purpose, which would isolate it from the main cluster. Worst case scenario it becomes full until stuff expires. At which point additional old revisions can't be cached for a while.

Popularity counter and per-user rate limit

Perhaps cache them only after a certain popularity threshold. If so, that would likely require storing counters somewhere which would surface the same issue as before, but on a smaller scale and with shorter TTLs. Even if it's just a small integer, allowing users to trigger creation of such counter for every old revision ever of every page/wiki could still cause major churn on Memcached.

Maybe it can be contained if the TTL is very low, e.g. a few minutes at most. This would likely need to go hand-in-hand with a new $wgRateLimits that controls how many uncached old revisions can be generated by a given user/IP. We'd fallback to a localised error message under a 4xx error asking the user to slow down and try again later.

Fixed-size popularity counter

Another way could be to limit the popularity counter not by time but by quantity. E.g. we could have a single per-wiki Memcached key that contains a list of upto N revision IDs and how often they were accessed. This would effectively enforce LRU on the counte itself by never tracking the counts for more than a fixed number of old revs. Then if the rev ID is in the list and above our popularity threshold we'd store the result and let the counter drop off.

In this idea, we would not need a rate limit to control the growth of stored renderings of old revs, and might also not need a separate ParserCache pool either. The impact would be sufficiently limited that we might be able to just store them in the regular ParserCache cluster. Maybe with a reduced TTL (e.g. 1 day instea of 30 days) to allow the space occupied by old revs to be re-used more quickly.

However, while this might not need a rate limit to control storage growth, we might still a rate limit here to reduce workloads for the app servers generating these old revision views.

PoolCounter

If we do allow old revs to be stored under some conditions, we'll also need to make sure PoolCounter works here to avoid multiple app servers from rendering the same when they reach the decision to allow it to be stored. This hopefully just works automatically once we hit that code path, but it'd be worth double checking that this is indeed the case.

Krinkle renamed this task from Wiki diffs take over 15s to load to Strategy for storing parser output for "old revision" (Popular diffs and permalinks).Mar 10 2020, 6:17 PM

Krinkle removed Krinkle as the assignee of this task.

Krinkle edited projects, added Wikimedia-Incident; removed Wikimedia-production-error.

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Krinkle moved this task from Doing (old) to Radar on the Performance-Team board.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle subscribed.

I just want to point out something here quickly:

The case that caused this bug to be raised was "a humorous diff went viral
on social media".

In such a case, a parsercache entry with a TTL of hours or even minutes
would have helped here. Would such a TTL ameliorate the unbounded data
growth concerns?

To answer that we would need to know how much storage that would consume if they are stored unconditionally.

Also, note that our parser caches are MySQL which means "TTL" is only a run-time evaluation. There is no actual eviction based on TTL or storage concerns. There is only a daily cronjob that removes rows older than 30 days. Depending on how quickly and how often we want to run that script, it will influence how much we would actually end up storing.

To know how much that would be, we'd probably want to analyse all web requests to oldid/permalink urls from logged-in and logged-out users. See how much there are at most within a single day over say a month. Then multiply that by the average size of a ParserCache entry, and then multiply that by 10?.

WDoranWMF moved this task from Ready (WIP:5) to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 1:53 PM

Taking this off the clinic duty board. This needs system design / strategy. I'm tagging TechCom for attention, and putting this into the inbox of cpt and performance. We need to decide who should come up with a solution, and in what timeframe.

cscott added a project: Parsing-Team--ARCHIVED.Apr 7 2020, 2:25 PM

Putting this on the (long-term!) radar of the parsing team. Since we are hoping to see ParserCache/RestBase integration work done in the next ~year by CPT, this would be a good desideratum to add to that CPT design work. Perhaps a similar time-limited cache mechanism could be used for stashing...

CCicalese_WMF edited projects, added Platform Engineering (Icebox); removed Platform Engineering.Apr 7 2020, 8:26 PM

If this gets to the point where there's a plan for the system to identify revisions that need caching, feel free to poke it back into the Platform Engineering inbox for further discussion. At the moment it seems someone has to figure out the appropriate heuristics first.

daniel moved this task from Inbox to Watching on the TechCom board.Apr 8 2020, 8:19 PM

Notes from TechCom meeting:

@tstarling mentioned that we can also look into using a better eviction algo in Memcached, something that isn't plain LRU. With that in place, or perhaps even as-is, maybe "just" putting it in Memcached will be fine.

I mentined that we already store various things in Memcached that have a huge number of variants (e.g. by user ID and by page ID), and while that can indeed push stuff out that it shouldn't in practice it seems to be fine. On the other hand, those values are relatively tiny compared to something as a ParserOutput object.

Another though from the TechCom meeting: we could just have the CDN cache output for old revisions and diffs for a short time (5 minutes?)

In T244058#6041722, @daniel wrote:

Another though from the TechCom meeting: we could just have the CDN cache output for old revisions and diffs for a short time (5 minutes?)

Invalidation and rev_deleted (or regular deletion) get trickier. Something like xkeys would maybe help (as long as only the top X revs are cacheable to avoid bloat in the data structure under the object hash). Without that, it depends on whether 5 min is "good enough" generally.

In T244058#6041593, @Krinkle wrote:

@tstarling mentioned that we can also look into using a better eviction algo in Memcached, something that isn't plain LRU. With that in place, or perhaps even as-is, maybe "just" putting it in Memcached will be fine.

Not necessarily in memcached. In 2010 I prototyped this idea in a standalone cache server written in C++ called maxcache. It used the Greedy-Dual-Size-Frequency algorithm.

The algorithm requires the client to specify a cost (typically miss service time), which is used to tune eviction, in addition to size and hit frequency.

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Apr 14 2020, 8:07 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

From the TechCom meeting: we could just put the parser output into memcached with a short expiry time.

tstarling claimed this task.May 13 2020, 8:31 PM

tstarling edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering (Icebox).May 13 2020, 8:34 PM

CCicalese_WMF moved this task from Backlog to Doing(WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.May 21 2020, 1:49 PM

Nemo_bis subscribed.Jun 26 2020, 2:45 PM

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:46 PM

daniel removed tstarling as the assignee of this task.Aug 18 2020, 9:32 PM

daniel moved this task from Doing(WIP:5) to Later on the Platform Team Workboards (Clinic Duty Team) board.

Assigning Holger, per yesterday's meeting.

daniel removed • holger.knust as the assignee of this task.Sep 24 2020, 7:47 PM

daniel moved this task from Doing(WIP:5) to Later on the Platform Team Workboards (Clinic Duty Team) board.

daniel added a project: Platform Engineering.

daniel removed a project: Platform Engineering.

daniel moved this task from Later to Discussing on the Platform Team Workboards (Clinic Duty Team) board.

daniel added a subscriber: • holger.knust.

Looking into this some more, we came across a number of issues, namely:

Diffs and permalinks don't share a code path for getting the ParserOutput for the old revision. DifferenceEnginer::getParserOutput uses WikiPage::getParserOutput, but Article::view() does its own thing.
Sharing code between page views and diff views would be nice. Do we want a PageOutputRender service?
We would probably want to apply short term caching for old revisions pretty much in the same places we are applying long-term caching for the current revision, via ParserCache.
We would need to split the short term cache in the same way we split the ParserCache. So should this logic be *in* ParserCache? Or do we pull the key generation logic out of ParserCache?
The ParserCache gets written by PoolWorkArticleView, but it gets read by different bits of code. This asymmetry makes it hard to add to the logic.
Should the cache be shared across data centers, like the diff cache?

With T263583: Introduce a ParserCacheFactory coming up, perhaps we should use a special ParserCache instance for old revisions, configured with a low TTL an a cluster-local cache. That seems simple enough. The tricky bit is the "should we use the parser cache" logic, which needs to be come "should we use the parser cache, and if yes, which one". We have this logic in at least two places, probably more. Would be nice to better encapsulate this.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:01 PM

TK-999 subscribed.Oct 19 2020, 4:44 PM

daniel mentioned this in T267234: Introduce a service object for obtaining rendered output for a page.Nov 4 2020, 4:47 PM

Reedy removed a project: Parsing-Team--ARCHIVED.Nov 4 2020, 4:57 PM

daniel mentioned this in T267832: Refactor PoolWorkArticleView to support caching of old revisions.Nov 13 2020, 11:49 AM

daniel closed subtask T267234: Introduce a service object for obtaining rendered output for a page as Resolved.

Change 640927 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Clean up PoolWorkArticleView

https://gerrit.wikimedia.org/r/640927

gerritbot added a project: Patch-For-Review.Nov 13 2020, 11:56 AM

ArielGlenn subscribed.Nov 13 2020, 12:11 PM

Change 640927 merged by jenkins-bot:
[mediawiki/core@master] Clean up PoolWorkArticleView

https://gerrit.wikimedia.org/r/640927

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.18; 2020-11-17).Nov 16 2020, 3:00 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2020, 3:10 PM

Krinkle removed a project: TechCom.Nov 25 2020, 8:32 PM

• Pchelolo closed subtask T268278: Make storage for old revision parser cache configurable as Resolved.Nov 30 2020, 5:11 PM

Krinkle unsubscribed.Dec 1 2020, 3:08 AM

• Pchelolo closed subtask T267832: Refactor PoolWorkArticleView to support caching of old revisions as Resolved.Dec 10 2020, 3:17 PM

Now old revision views are cached in WANObjectCache for 1 hour. Please reopen if needed.

Thanks for working on this. Let's monitor parsercache usage and check if we need to increase purging on the disk-based parsercache component as a consequence of low-ttl entries that otherwise wouldn't be deleted until UTC night (probably not needed, but worth mention it, or maybe not necessary if it only hits memcache, haven't looked at details?).

Thank again to you for working on this!

maybe not necessary if it only hits memcache, haven't looked at details?

The old revisions cache is backed by WANObjectCache (memcached/mcrouter only) with no persistent store component. We figured that we don't need mysql component for this cache.

Ah, cool then. One less thing to worry about, but was going to offer looking at it :-).

Krinkle mentioned this in T148442: Translate extension tags make page loading pass from 7 to more than 25 seconds loading.Jan 11 2021, 8:43 PM

Krinkle mentioned this in T287109: Cache of old revisions not updated.Aug 17 2021, 8:06 PM

Krinkle mentioned this in T293905: Performance review of proposed Kartographer/Kartotherian revid support.Mar 17 2022, 4:17 PM

	jijiki
	Feb 1 2020, 1:22 PM

Strategy for storing parser output for "old revision" (Popular diffs and permalinks)Closed, ResolvedPublicActions