Page MenuHomePhabricator

Remove long term caching and active purging for Parsoid endpoints in RESTBase
Closed, ResolvedPublic

Description

As part of T334238 we want to move traffic from the parsoid endpoints to core endpoints. This would be a lot easier if we didn't need to implement active purging for the new URLs. It looks like we don't have to, and we can get away with a low TTL (a couple of minutes) instead. This is based on the observation that the cache hit rate for endpoints under /api/rest_v1/page/html endpoints is relatively low (around 20% per https://w.wiki/AALE).

This would allow us to greatly reduce the complexity of the system architecture. We can turn off storage and pre-generation of Parsoid in RESTBase (pending T330036), and we can stop emitting purge events from RESTBase. RESTBase can be come a simple procy for Parsoid.

Perhaps we don't even have to make any changes to Parsoid in RESTBase, we just re-route the /api/rest_v1/page/html and friends directly to the parsoid endpoints in MW.

Event Timeline

I'm a little leery of dropping the TTL really-short. I get the argument for the normal case, but we also have to consider the possibility that something out there on the Internet could cause traffic surges to some of these URLs and we'd lose some amount of caching defenses against it with a short TTL (esp if we're also no longer pregenerating them, making such traffic more-expensive on the inside). Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?

Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?

The idea is that we don't implement long term caching with active purging for the new endpoints at all. If we don't need it for the old endpoints, we don't need it for the new endpoints. This would make the architecture and the migration a whole lot simpler.

It seems to me like a couple of minutes would be good enough at least for organic spikes. Do you think the edge cache is effective protection against DDoS?

Krinkle renamed this task from Remove long term caching and active purging for Parsoid endpoints in RESTbase to Remove long term caching and active purging for Parsoid endpoints in RESTBase.May 22 2024, 5:33 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?

The idea is that we don't implement long term caching with active purging for the new endpoints at all. If we don't need it for the old endpoints, we don't need it for the new endpoints. This would make the architecture and the migration a whole lot simpler.

It seems to me like a couple of minutes would be good enough at least for organic spikes. Do you think the edge cache is effective protection against DDoS?

I think I'm lost in some confusion here, as I initially thought this was only about cutting TTLs/purges and/or redirecting for the legacy restbase URI paths, not the new parsoid ones.

Either way, in general: yes, we prefer long-caching if we can get away with it. Our default is to try for 24h when applayer TTL allows for it. All kinds of things happen on the internet (not just DDoS, but yes, it helps with some forms of those), and there's operational concerns as well (e.g. how long we can keep a node or site offline for maintenance or emergencies and not come back to an empty cache). If there's a short-TTL (or uncacheable) dynamic result for anon traffic that's expensive to serve, eventually the Internet will find a way to make it a problem for us.

I think I'm lost in some confusion here, as I initially thought this was only about cutting TTLs/purges and/or redirecting for the legacy restbase URI paths, not the new parsoid ones.

This ticket is just about the old endpoints, yes. But if that approach is successful for the old endpoints, we will also use it for the enw endpoints. The point is to try it out in a way we can easily revert.

Either way, in general: yes, we prefer long-caching if we can get away with it. Our default is to try for 24h when applayer TTL allows for it. All kinds of things happen on the internet (not just DDoS, but yes, it helps with some forms of those), and there's operational concerns as well (e.g. how long we can keep a node or site offline for maintenance or emergencies and not come back to an empty cache). If there's a short-TTL (or uncacheable) dynamic result for anon traffic that's expensive to serve, eventually the Internet will find a way to make it a problem for us.

Yes I can see that problem, but the alternative requires active purging, which itself can post performance challenges and can act as a DoS vector, see T353876.

My impression is that the approach we have been taking so far for HTML served from the REST API, namely long TTL and active purging, has caused more trouble than it did good. I would like to experiment with going the other way. If we find that we need to go back, we always can. But it seems like a good idea to try the cheap and simple alternative before building the complex and expensive one.

Do we have a number for how long it is acceptable for vandalism to remain visible? in my mind, 5 minutes sounds like the max. Considering multiple layers of caching, that probably means max-age=60 or so.

Relaying what I said in a meeting: I think given the caching numbers I think it makes sense to remove long-term caching at the edge and just keep a shortish-ttl one.

In terms of the transition, I'd do something like:

  • reduce the TTL
  • wait for the old TTL to expire on the urls we already have in storage
  • remove explicit url purging for restbase urls

Change #1042361 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] REST: Document currents cache duration for page content.

https://gerrit.wikimedia.org/r/1042361

FJoseph-WMF changed the task status from Open to In Progress.Jun 13 2024, 3:35 PM

Change #1042361 merged by jenkins-bot:

[mediawiki/core@master] REST: Document currents cache duration for page content.

https://gerrit.wikimedia.org/r/1042361

BPirkle claimed this task.
BPirkle moved this task from In Progress to Done on the MW-Interfaces-Team board.