Page MenuHomePhabricator

Remove lag protection from WANObjectCache
Open, Needs TriagePublic

Description

The WANObjectCache is mammoth of a class with complexity so high that barely anyone can understand and/or contribute to. It is also extremely critical so bugs can lead to major issues.

A big portion of this complexity is keeping track of replication lag and reducing TTL of keys if the lag is too high. I believe this should be removed because:

  • High replag used to be a much bigger problem when databases were using HDDs but since now we use SSD, the replag has been almost all the time around 0.2s or below.
  • MediaWiki automatically avoids reading from replicas that have high lag and practically depools them.
  • Due to above, the only case these code paths get triggered are when all replicas are lagged in which it's an incident and we have bigger problems than some stale data.
    • During the incident, it could make things worse by reducing TTL and forcing more db reads.
  • It makes WAN code unpredictable and harder to test under such conditions.
  • There has been always the assumption that data in WAN is stale, at least for a bit. So I'm failing to see this fixing an actual problem that is hurting users or integrity of the data.
  • It couples the cache infrastructure and the database infrastructure breaking encapsulation on two pretty large and complex components of mediawiki and the infrastructure.

It is such a big part of the class that I think I have to do this gradually in multiple patches. Every time I started the clean up, it sprawled out of control.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1278474 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] WANObjectCache: First round of removal of lag protection logic

https://gerrit.wikimedia.org/r/1278474

Due to above, the only case these code paths get triggered are when all replicas are lagged in which it's an incident and we have bigger problems than some stale data.

According to T113204, the feature was introduced as an incident mitigation measure. The concern was that partition of the two DCs would lead to stale values being cached for weeks, prolonging the visible effects of a network outage. It was just theoretical, it wasn't a response to an actual incident. It was introduced years before we actually deployed multi-DC.

There has been always the assumption that data in WAN is stale, at least for a bit.

The purpose of the code is to avoid stale data persisting in the cache for more than a bit. It aims to limit the durability of stale data.

Yes it's complexity, and it leaks out of the interface, it's not just internal complexity. I can see why you want to get rid of it. But ideally you should understand what it's for before you delete it. Maybe we could have some alternative plan for dealing with the scenario described in T113204, like a rolling restart of memcached servers.

High replag used to be a much bigger problem when databases were using HDDs but since now we use SSD, the replag has been almost all the time around 0.2s or below

Yes, this does seem better these days. The rolling restart idea could be used for incidents.

It couples the cache infrastructure and the database infrastructure breaking encapsulation on two pretty large and complex components of mediawiki and the infrastructure.

Part of Database::getCacheSetOptions() is to minimize the chance saving uncommitted info that no other session should see and that may end up getting rolled back anyway. If regeneration callbacks are strict about using DB_REPLICA, it should rarely happen on production (given zero load for the primaries). For dev/vanilla installs with one DB server, the closest thing would be using a different connection for DB_PRIMARY/DB_REPLICA to the same mysql/postgres server (that could not even be done for sqlite). It's tricky since cache asset callback code doesn't "own" the current transaction (if any) and DB_REPLICA actually uses the primary. I've tried to discourage anything using "rollback" for "business as usual" errors like permission checks, but they will still happen, and lots of stuff uses the WANCache.

Regarding coupling..right now, WANCache does not strictly depend on Rdbms but rather it just wants something that meets the 'lag'/'since'/'pending' criteria from the WANCache documentation blocks. It could be moved to use an interface. It could be redone as an injected object, of a certain interface, that checks the tracked primary DB connections for this info and gets called before invoking regeneration callbacks (so they don't have to get the info themselves). There are ways to deal with the programmatic coupling. However, I assume that the more underlying conceptual and infrastructure coupling is more of focus here. Removing replication lag detection would largely deal with that. The "pending" aspects is more of a conceptual hard coupling that's to avoid given the nature of transactions.

Due to above, the only case these code paths get triggered are when all replicas are lagged in which it's an incident and we have bigger problems than some stale data.

According to T113204, the feature was introduced as an incident mitigation measure. The concern was that partition of the two DCs would lead to stale values being cached for weeks, prolonging the visible effects of a network outage. It was just theoretical, it wasn't a response to an actual incident. It was introduced years before we actually deployed multi-DC.

There has been always the assumption that data in WAN is stale, at least for a bit.

The purpose of the code is to avoid stale data persisting in the cache for more than a bit. It aims to limit the durability of stale data.

Yes it's complexity, and it leaks out of the interface, it's not just internal complexity. I can see why you want to get rid of it. But ideally you should understand what it's for before you delete it. Maybe we could have some alternative plan for dealing with the scenario described in T113204, like a rolling restart of memcached servers.

Thanks for looking it up. I missed that. I can talk to serviceops to make sure that would be the case. I'd say even if we change our minds in the future and try to use something different. We can re-add it with a much simpler logic.

High replag used to be a much bigger problem when databases were using HDDs but since now we use SSD, the replag has been almost all the time around 0.2s or below

Yes, this does seem better these days. The rolling restart idea could be used for incidents.

It couples the cache infrastructure and the database infrastructure breaking encapsulation on two pretty large and complex components of mediawiki and the infrastructure.

Part of Database::getCacheSetOptions() is to minimize the chance saving uncommitted info that no other session should see and that may end up getting rolled back anyway. If regeneration callbacks are strict about using DB_REPLICA, it should rarely happen on production (given zero load for the primaries). For dev/vanilla installs with one DB server, the closest thing would be using a different connection for DB_PRIMARY/DB_REPLICA to the same mysql/postgres server (that could not even be done for sqlite). It's tricky since cache asset callback code doesn't "own" the current transaction (if any) and DB_REPLICA actually uses the primary. I've tried to discourage anything using "rollback" for "business as usual" errors like permission checks, but they will still happen, and lots of stuff uses the WANCache.

I understand but is it something that actually happens? For example, did we have a case of a user staying admin longer because the regeneration logic read from a replica and now we are stuck with it for weeks. Second is that we are still at the mercy of devs to know this and pass the setOptions which actually many cases (even involving reading a replica) doesn't happen and so if it's a safety mechanism, it's not really doing what is supposed to protect us from.

The context for adding serviceops: We want to know if that'd be okay for serviceops team to be able to run a rolling reboot (or purge) of memcached hosts in rare cases that dcs split for extended period of time and we might end up with stale cached data on mecached. If that's something serviceops can do on such rare occasions, I think it's prudent to simplify the logic in WAN code.

The context for adding serviceops: We want to know if that'd be okay for serviceops team to be able to run a rolling reboot (or purge) of memcached hosts in rare cases that dcs split for extended period of time and we might end up with stale cached data on mecached. If that's something serviceops can do on such rare occasions, I think it's prudent to simplify the logic in WAN code.

The answer is, we have some questions before committing to this.

If my understanding is correct (and please correct me if not), we are removing complex code (which is great), and replacing it with a "please restart 18 (x2?) memcached servers as soon as possible". It is rare, sure, but:

  • how do we detect that state?
    • if the answer is "a human", then we are replacing code with not one, but two humans
  • who makes the call to trigger it?
  • how soon should this happen after it is detected?
  • how fast should this happen?
  • if this is something we'd expect the person on-call to perform?

The context for adding serviceops: We want to know if that'd be okay for serviceops team to be able to run a rolling reboot (or purge) of memcached hosts in rare cases that dcs split for extended period of time and we might end up with stale cached data on mecached. If that's something serviceops can do on such rare occasions, I think it's prudent to simplify the logic in WAN code.

The answer is, we have some questions before committing to this.

Sounds good. If you want to, the original ticket has a lot more context of why it was implemented: T113204: Smart caching logic for handling cross-DC network outages

If my understanding is correct (and please correct me if not), we are removing complex code (which is great), and replacing it with a "please restart 18 (x2?) memcached servers as soon as possible".

Only one dc would be needed. Not both.

It is rare,

To my understanding brain split of dcs happened only once or twice. So it's quite rare.

sure, but:

  • how do we detect that state?
    • if the answer is "a human", then we are replacing code with not one, but two humans

The case we are talking about is that when DCs split and we have quite large replication lag in the secondary dc. I'm sure our alerting would let us know that we have this and work would be needed to re-connect DCs back. It's one of those situations that when it happens, we have way bigger problems but also making sure this doesn't get lost among them.

  • who makes the call to trigger it?

When that happens, it'd be already an incident, so it can be part of the runbook afterwards.

  • how soon should this happen after it is detected?

My stance is that it's not that important. Anything in mw that is sensitive to replag usually has a decent mechanism to make sure it's not affected and the current state is even done randomly by devs and applied arbitrarily and nothing has exploded. I'd say it would be low prio but if we missed something and users noticed something important staying stale after the incident, we should fasten it.

  • how fast should this happen?

I don't think it needs to happen too fast. Per above. These are just worries that things might break but there hasn't been a case reported that anything actually got broken (despite many WAN cases not using this feature)

  • if this is something we'd expect the person on-call to perform?

Nah, I don't think so.

The context for adding serviceops: We want to know if that'd be okay for serviceops team to be able to run a rolling reboot (or purge) of memcached hosts in rare cases that dcs split for extended period of time and we might end up with stale cached data on mecached. If that's something serviceops can do on such rare occasions, I think it's prudent to simplify the logic in WAN code.

The answer is, we have some questions before committing to this.

Sounds good. If you want to, the original ticket has a lot more context of why it was implemented: T113204: Smart caching logic for handling cross-DC network outages

If my understanding is correct (and please correct me if not), we are removing complex code (which is great), and replacing it with a "please restart 18 (x2?) memcached servers as soon as possible".

Only one dc would be needed. Not both.

It is rare,

To my understanding brain split of dcs happened only once or twice. So it's quite rare.

sure, but:

  • how do we detect that state?
    • if the answer is "a human", then we are replacing code with not one, but two humans

The case we are talking about is that when DCs split and we have quite large replication lag in the secondary dc. I'm sure our alerting would let us know that we have this and work would be needed to re-connect DCs back. It's one of those situations that when it happens, we have way bigger problems but also making sure this doesn't get lost among them.

  • who makes the call to trigger it?

When that happens, it'd be already an incident, so it can be part of the runbook afterwards.

  • how soon should this happen after it is detected?

My stance is that it's not that important. Anything in mw that is sensitive to replag usually has a decent mechanism to make sure it's not affected and the current state is even done randomly by devs and applied arbitrarily and nothing has exploded. I'd say it would be low prio but if we missed something and users noticed something important staying stale after the incident, we should fasten it.

  • how fast should this happen?

I don't think it needs to happen too fast. Per above. These are just worries that things might break but there hasn't been a case reported that anything actually got broken (despite many WAN cases not using this feature)

  • if this is something we'd expect the person on-call to perform?

Nah, I don't think so.

Before removing said code, we should have in place something like this:

large replication lag between DCs -> alert+Related Runbook -> Dashboard with link to Runbook -> Runbook has a section "Secondary DC Memcached restart"

I understand it is a drag, but what it reads here, is yet another piece of tribal knowledge in the making.

There has been always the assumption that data in WAN is stale, at least for a bit. So I'm failing to see this fixing an actual problem that is hurting users or integrity of the data.

Stale data on read is fine indeed, because we treat reads from WANCache the same as reads from replica DBs, callers must tolerate a certain amount of lag, and otherwise require an uncached primary DB read (i.e. to inform writes during POST requests). There should generally not be a case where a read query can be satisfied by replica DB but not WANCache. The idea is that caching can be used without adding additional concens to callers.

The issue with stale DB reads is not the effect on the HTTP response (which will have a shorted CDN expiry), but the effect on future callers. Once stale data is placed into the cache, it does not self-correct.

I think as a first step, I'd like to simplify and automate this, rather than remove it. We can measure the impact with a Prometheus counter.

This would accomplish the same as your patch:

  • Remove the WANObjectCache lag and since options from the public API.
  • Remove Database::getCacheSetOptions method and need for developers to call it as part of boilerplate.

Exept, it would retain the protection in an automatic way, such that we automatically shorten the TTL for any WANObjectCache::getWithSet calls when a replica DB was used that is lagged outside acceptable bounds. This is similar to what we already do:

  • shorten HTTP response cache to $wgCdnMaxageLagged, if lagged replica is used.
  • shorten HTTP response cache to $wgCdnMaxageStale, if known-stale ParserOutput is used via lost PoolCounter lock.
  • [this task] shorten Memcached expiry to 30s, if there we involved a lagged replica DB.
  • shorten Memcached expiry to 1s, during the 10s hold-off period after a purge (via interim key).
  • shorten Memcached expiry to 1s, if using TSE-lock option, value is absent, and lost the lock (via interim value).
  • shorten Memcached expiry to 10s, if the hashed host is TKO (via mcrouter gutter pool).

I've just found an arguably even more important reason to automate this, which is the pending option that prevents writing uncommitted data to a shared cache. Depending on how we solve that without public options and boilerplate, this should be easy to solve the same way.

The current implementation handles these together so removing one without the other does not really change what developers need to do (i.e. call Database::getCacheSetOptions, which invisibly sets one or more options).

There has been always the assumption that data in WAN is stale, at least for a bit. So I'm failing to see this fixing an actual problem that is hurting users or integrity of the data.

Stale data on read is fine indeed, because we treat reads from WANCache the same as reads from replica DBs, callers must tolerate a certain amount of lag, and otherwise require an uncached primary DB read (i.e. to inform writes during POST requests). There should generally not be a case where a read query can be satisfied by replica DB but not WANCache. The idea is that caching can be used without adding additional concens to callers.

The issue with stale DB reads is not the effect on the HTTP response (which will have a shorted CDN expiry), but the effect on future callers. Once stale data is placed into the cache, it does not self-correct.

I think as a first step, I'd like to simplify and automate this, rather than remove it. We can measure the impact with a Prometheus counter.

This would accomplish the same as your patch:

  • Remove the WANObjectCache lag and since options from the public API.
  • Remove Database::getCacheSetOptions method and need for developers to call it as part of boilerplate.

Exept, it would retain the protection in an automatic way, such that we automatically shorten the TTL for any WANObjectCache::getWithSet calls when a replica DB was used that is lagged outside acceptable bounds. This is similar to what we already do:

  • shorten HTTP response cache to $wgCdnMaxageLagged, if lagged replica is used.
  • shorten HTTP response cache to $wgCdnMaxageStale, if known-stale ParserOutput is used via lost PoolCounter lock.
  • [this task] shorten Memcached expiry to 30s, if there we involved a lagged replica DB.
  • shorten Memcached expiry to 1s, during the 10s hold-off period after a purge (via interim key).
  • shorten Memcached expiry to 1s, if using TSE-lock option, value is absent, and lost the lock (via interim value).
  • shorten Memcached expiry to 10s, if the hashed host is TKO (via mcrouter gutter pool).

I understand where are you coming from but I rather first remove this and if it's actually needed add something afresh on a blank slate. I haven't seen a concrete example of lagged read and stored in cache being a problem. Devs should pick a short TTL for WAN if they think the data is sensitive and updates need to be reflected as soon as possible. And that's a good hygiene because deletion, purge, etc could fail in a distributed system.

I've just found an arguably even more important reason to automate this, which is the pending option that prevents writing uncommitted data to a shared cache. Depending on how we solve that without public options and boilerplate, this should be easy to solve the same way.

The current implementation handles these together so removing one without the other does not really change what developers need to do (i.e. call Database::getCacheSetOptions, which invisibly sets one or more options).

I would keep pending right now but I disagree on "does not really change what developers need to do". For a simple reason, 'pending' only applies if the db provided is a primary database and it gets skipped if the db is replica. And with one or two exceptions here and there, only replicas are being passed:

image.png (1,031×765 px, 251 KB)

That means we can simply remove that basically everywhere.

The context for adding serviceops: We want to know if that'd be okay for serviceops team to be able to run a rolling reboot (or purge) of memcached hosts in rare cases that dcs split for extended period of time and we might end up with stale cached data on mecached. If that's something serviceops can do on such rare occasions, I think it's prudent to simplify the logic in WAN code.

Just noting that the restarts would be for any case where replag in any DC is longer than 6 or so seconds (basically enough for the delete() tombstones to no longer be effective. In any case, this should rarely come up.

I would keep pending right now but I disagree on "does not really change what developers need to do". For a simple reason, 'pending' only applies if the db provided is a primary database and it gets skipped if the db is replica. And with one or two exceptions here and there, only replicas are being passed:

image.png (1,031×765 px, 251 KB)

That means we can simply remove that basically everywhere.

Few (preferably zero) callers use DB_PRIMARY in the value callback, which is good, though for vanilla installs, using DB_REPLICA does not mean that you don't end up with the primary DB sometimes (perhaps always) anyway.

Yeah. It was brought up here and I asked the question there. Why do we need to build such protection? Has there been a case of uncommitted data ending up in WAN in vanilla setups? The patch that introduced it is not connected to any tickets. Also since WAN relies on callers actually passing the db object which in many cases it doesn't happen so I'd argue it hasn't done the job of that protection well but nothing ever got reported on it so I'd say it was never needed in the first place.