Page MenuHomePhabricator

Strategy for storing parser output for "old revision" (Popular diffs and permalinks)
Closed, ResolvedPublic

Description

Page https://en.wikipedia.org/w/index.php?title=European_Union&type=revision&diff=938561921&oldid=938557616 became quite popular due to a reddit comment, and we have noticed that it takes about 15s to render

Profiling data: https://phabricator.wikimedia.org/P10297

Both are running 1.35.0-wmf.16 at the moment.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Links to old (non-current) versions due not use the parser cache. This means that rendering will always require a full parse.

[CUT]

Some sort of parser caching could be considered for old links that get high traffic. We do not want to waste space on pcXXXX mariadb servers nor LRU flood memcached with large blobs, so there would have to be some "hotness" estimation logic, like a hitcounter.

Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per user or IP, probably via poolcounter.

Note that even old version parses for such page views go through PoolWorkArticleView. They just don't use memcached.

What about tuning HTTP frontend caching? Last revision is very dynamic, but a hardcoded diff maybe could return stale results more often, as I can only see it changing on deletion/page move?

Links to old (non-current) versions due not use the parser cache. This means that rendering will always require a full parse.

[CUT]

Some sort of parser caching could be considered for old links that get high traffic. We do not want to waste space on pcXXXX mariadb servers nor LRU flood memcached with large blobs, so there would have to be some "hotness" estimation logic, like a hitcounter.

Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per user or IP, probably via poolcounter.

Note that even old version parses for such page views go through PoolWorkArticleView. They just don't use memcached.

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

aaron removed aaron as the assignee of this task.Feb 10 2020, 9:28 PM
aaron added a project: Platform Engineering.
aaron moved this task from Doing (old) to Radar on the Performance-Team board.
aaron edited projects, added Performance-Team (Radar); removed Performance-Team.
aaron subscribed.
Anomie subscribed.

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

While that seems a valid request in general, I don't think it would have any effect on this task. This is about many users/IPs hitting something, not just one.

Sure, they do, but IIRC that limits the number of times we do parse that revision of that article in concurrency. What I am proposing is limiting the number of re-parses any single user is allowed to perform too, in order to rate-limit scrapers that generate expensive operations.

While that seems a valid request in general, I don't think it would have any effect on this task. This is about many users/IPs hitting something, not just one.

Correct - we have two issues we've seen and I coalesced the two. We've had scrapers requiring a lot of diffs all at once.

I've gone through dozens of very old diffs from unpopular pages (hoping for a cache miss). From what I can tell, the slowdown is mainly from the page rendering of the old revision below the diff. The majority of time is not spent in diff logic, and indeed I was unable to reproduce a timeout on any additional obscure diffs I viewed once I started using diffonly=1 (also available in the user preferences).

This narrows down the issue to wikitext parsing of non-current revisions which affects any permalink (e.g. with oldid query) regardless of whether it has a diff on top.

As example:

Some further brain storming?

ParserCache?
  • Do we allow old revisions to be stored in ParserCache?

Allowing this in general might be risky, as it could grow uncontrollably e.g. when every revision ever is being crawle (afaik we currently don't have LRU in ParserCache, we only evict stuff based on 30-day TTL).

Memcache? Separate ParserCache?
  • If not storing in ParserCache generally, then when, how and where?

Memcached doesn't seem suitable for storing thousands/millions of large multi-megabyte blobs that have a high chance of never being used again. It would fill up the backends and evict more important stuff.

Perhaps an isolated "mini" ParserCache cluster for this purpose, which would isolate it from the main cluster. Worst case scenario it becomes full until stuff expires. At which point additional old revisions can't be cached for a while.

Popularity counter and per-user rate limit

Perhaps cache them only after a certain popularity threshold. If so, that would likely require storing counters somewhere which would surface the same issue as before, but on a smaller scale and with shorter TTLs. Even if it's just a small integer, allowing users to trigger creation of such counter for every old revision ever of every page/wiki could still cause major churn on Memcached.

Maybe it can be contained if the TTL is very low, e.g. a few minutes at most. This would likely need to go hand-in-hand with a new $wgRateLimits that controls how many uncached old revisions can be generated by a given user/IP. We'd fallback to a localised error message under a 4xx error asking the user to slow down and try again later.

Fixed-size popularity counter

Another way could be to limit the popularity counter not by time but by quantity. E.g. we could have a single per-wiki Memcached key that contains a list of upto N revision IDs and how often they were accessed. This would effectively enforce LRU on the counte itself by never tracking the counts for more than a fixed number of old revs. Then if the rev ID is in the list and above our popularity threshold we'd store the result and let the counter drop off.

In this idea, we would not need a rate limit to control the growth of stored renderings of old revs, and might also not need a separate ParserCache pool either. The impact would be sufficiently limited that we might be able to just store them in the regular ParserCache cluster. Maybe with a reduced TTL (e.g. 1 day instea of 30 days) to allow the space occupied by old revs to be re-used more quickly.

However, while this might not need a rate limit to control storage growth, we might still a rate limit here to reduce workloads for the app servers generating these old revision views.

PoolCounter

If we do allow old revs to be stored under some conditions, we'll also need to make sure PoolCounter works here to avoid multiple app servers from rendering the same when they reach the decision to allow it to be stored. This hopefully just works automatically once we hit that code path, but it'd be worth double checking that this is indeed the case.

Krinkle renamed this task from Wiki diffs take over 15s to load to Strategy for storing parser output for "old revision" (Popular diffs and permalinks).Mar 10 2020, 6:17 PM
Krinkle removed Krinkle as the assignee of this task.
Krinkle edited projects, added Wikimedia-Incident; removed Wikimedia-production-error.
Krinkle moved this task from Doing (old) to Radar on the Performance-Team board.
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle subscribed.

I just want to point out something here quickly:

The case that caused this bug to be raised was "a humorous diff went viral
on social media".

In such a case, a parsercache entry with a TTL of hours or even minutes
would have helped here. Would such a TTL ameliorate the unbounded data
growth concerns?

To answer that we would need to know how much storage that would consume if they are stored unconditionally.

Also, note that our parser caches are MySQL which means "TTL" is only a run-time evaluation. There is no actual eviction based on TTL or storage concerns. There is only a daily cronjob that removes rows older than 30 days. Depending on how quickly and how often we want to run that script, it will influence how much we would actually end up storing.

To know how much that would be, we'd probably want to analyse all web requests to oldid/permalink urls from logged-in and logged-out users. See how much there are at most within a single day over say a month. Then multiply that by the average size of a ParserCache entry, and then multiply that by 10?.

daniel subscribed.

Taking this off the clinic duty board. This needs system design / strategy. I'm tagging TechCom for attention, and putting this into the inbox of cpt and performance. We need to decide who should come up with a solution, and in what timeframe.

Putting this on the (long-term!) radar of the parsing team. Since we are hoping to see ParserCache/RestBase integration work done in the next ~year by CPT, this would be a good desideratum to add to that CPT design work. Perhaps a similar time-limited cache mechanism could be used for stashing...

If this gets to the point where there's a plan for the system to identify revisions that need caching, feel free to poke it back into the Platform Engineering inbox for further discussion. At the moment it seems someone has to figure out the appropriate heuristics first.

Notes from TechCom meeting:

  • @tstarling mentioned that we can also look into using a better eviction algo in Memcached, something that isn't plain LRU. With that in place, or perhaps even as-is, maybe "just" putting it in Memcached will be fine.
  • I mentined that we already store various things in Memcached that have a huge number of variants (e.g. by user ID and by page ID), and while that can indeed push stuff out that it shouldn't in practice it seems to be fine. On the other hand, those values are relatively tiny compared to something as a ParserOutput object.

Another though from the TechCom meeting: we could just have the CDN cache output for old revisions and diffs for a short time (5 minutes?)

Another though from the TechCom meeting: we could just have the CDN cache output for old revisions and diffs for a short time (5 minutes?)

Invalidation and rev_deleted (or regular deletion) get trickier. Something like xkeys would maybe help (as long as only the top X revs are cacheable to avoid bloat in the data structure under the object hash). Without that, it depends on whether 5 min is "good enough" generally.

  • @tstarling mentioned that we can also look into using a better eviction algo in Memcached, something that isn't plain LRU. With that in place, or perhaps even as-is, maybe "just" putting it in Memcached will be fine.

Not necessarily in memcached. In 2010 I prototyped this idea in a standalone cache server written in C++ called maxcache. It used the Greedy-Dual-Size-Frequency algorithm.

The algorithm requires the client to specify a cost (typically miss service time), which is used to tune eviction, in addition to size and hit frequency.

From the TechCom meeting: we could just put the parser output into memcached with a short expiry time.

Assigning Holger, per yesterday's meeting.

daniel added a project: Platform Engineering.
daniel removed a project: Platform Engineering.
daniel added a subscriber: holger.knust.

Looking into this some more, we came across a number of issues, namely:

  • Diffs and permalinks don't share a code path for getting the ParserOutput for the old revision. DifferenceEnginer::getParserOutput uses WikiPage::getParserOutput, but Article::view() does its own thing.
  • Sharing code between page views and diff views would be nice. Do we want a PageOutputRender service?
  • We would probably want to apply short term caching for old revisions pretty much in the same places we are applying long-term caching for the current revision, via ParserCache.
  • We would need to split the short term cache in the same way we split the ParserCache. So should this logic be *in* ParserCache? Or do we pull the key generation logic out of ParserCache?
  • The ParserCache gets written by PoolWorkArticleView, but it gets read by different bits of code. This asymmetry makes it hard to add to the logic.
  • Should the cache be shared across data centers, like the diff cache?

With T263583: Introduce a ParserCacheFactory coming up, perhaps we should use a special ParserCache instance for old revisions, configured with a low TTL an a cluster-local cache. That seems simple enough. The tricky bit is the "should we use the parser cache" logic, which needs to be come "should we use the parser cache, and if yes, which one". We have this logic in at least two places, probably more. Would be nice to better encapsulate this.

Change 640927 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Clean up PoolWorkArticleView

https://gerrit.wikimedia.org/r/640927

Change 640927 merged by jenkins-bot:
[mediawiki/core@master] Clean up PoolWorkArticleView

https://gerrit.wikimedia.org/r/640927

Pchelolo claimed this task.
Pchelolo subscribed.

Now old revision views are cached in WANObjectCache for 1 hour. Please reopen if needed.

Thanks for working on this. Let's monitor parsercache usage and check if we need to increase purging on the disk-based parsercache component as a consequence of low-ttl entries that otherwise wouldn't be deleted until UTC night (probably not needed, but worth mention it, or maybe not necessary if it only hits memcache, haven't looked at details?).

Thank again to you for working on this!

maybe not necessary if it only hits memcache, haven't looked at details?

The old revisions cache is backed by WANObjectCache (memcached/mcrouter only) with no persistent store component. We figured that we don't need mysql component for this cache.

Ah, cool then. One less thing to worry about, but was going to offer looking at it :-).