Page MenuHomePhabricator

Audit *all* datastores to see how DR failover is handled
Open, NormalPublic

Description

We have a good mix of storage systems (mariadb primaries, mariadb external stores, swift, redis (main stash), cassandra.

For stores that are strongly consistent over DCs (we don't have many), there isn't much to do. Though such stores would only be in two DCs, which is odd for an HA system.

For last-write-wins stores like cassandra, we need to know:
a) How much data might be lost on emergency switch-over? (LOCAL writes in cassandra for example, can we get a sense of the lag of getting data to remote DCs?)
b) If "newer" data at the time that was lost is visible again and then replicates, does that hurt anything?
c) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase entries)
d) If QUORUM reads/writes are used, they might fail if 1 of 2 DCs are down...that implies a need for switch over...
e) How do we go read only and "wait for things to catch up" with scripts?

For eventually consistent stores like swift + swiftrepl, we need to know:
a) How much data might be lost on switch over (can we have a sense of "lag times", hard to do since there is no replication log)
b) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase entries)
c) What do we need for fast master/slave cluster switchover?
d) How do reconcile swift DELETEs safely without tombstones? Should MW use temp 404 tombstones? Would SHA1 original paths help?
e) How do we go read only and "wait for things to catch up" with scripts?

For replicated stores like mariadb slaves, we need to know:
a) How much data might be lost on switch over (how robust is "too lagged; read only" mode?)
b) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase/ES entries)
c) How do we go read only and "wait for things to catch up" with scripts?

For caches (memcached/CDN), we need to make sure:
a) they are not used to determine writes
b) if failover causes some recent data to be lost, then bogus entries might matter; should WAN cache support generation time range blacklisting? do we also do CDN bans based on timestamp?
c) what kind of bogus entries we are willing to tolerate

Since some stores have several users ("calling code"), we need to check each. For example, users of FileBackend::doQuickOperations() can tolerate sloppy LWW logic and the occasional DELETE due to swift-repl changing direction. For FileBackend::doOperations() calls, like originals, we want to be more careful (like use 404 tombstones to know a missing file was deleted and not just missing due to switchover).

A general issue is also that of config. MediaWiki has lots of config deciding what DB/swift to talk to. If stale config is still running in prod due to network partitions (breaking scap and so on), we can still have problems. One option is to have config switches around "what DC is active". The active DC name could be pulled from a file, managed by ecd. We could have apaches go read only if that file is out of contact with ecd. I'll spin this off mostly to another task: T114273.

Some notes archived at https://etherpad.wikimedia.org/p/multi-dc-mediawiki

Event Timeline

aaron created this task.Sep 30 2015, 5:31 PM
aaron claimed this task.
aaron raised the priority of this task from to Normal.
aaron updated the task description. (Show Details)
aaron added subscribers: Krinkle, jcrespo, Glaisher and 13 others.
aaron added a comment.Sep 30 2015, 5:36 PM

I can take on most of the auditing (since it is mostly MW code). gwicke/marco will need to look at some restbase bits.

Some of the questions are operational though, like measuring/bounding the lag of datastores and making sync-switchover scripts.

aaron updated the task description. (Show Details)Sep 30 2015, 5:40 PM
aaron set Security to None.
GWicke added a comment.EditedSep 30 2015, 10:39 PM

Here is a first pass for Cassandra:

For last-write-wins stores like cassandra, we need to know:
a) How much data might be lost on emergency switch-over? (LOCAL writes in cassandra for example, can we get a sense of the lag of getting data to remote DCs?)

Short answer: In most cases / the longer run, approximately none.

Longer answer: All writes are sent to local and remote nodes at once, so the inconsistency window is fairly small, typically a small multiple of the network latency (see this handy calculator). When the broken DC comes back, eventual consistency & MVCC will fix the remaining inconsistencies. However, if there are writes based on inconsistent data right after fail-over, then there could be some forms of corruption. If the former primary DC burnt down, then it is possible that some writes that didn't reach the remote DC are permanently lost.

b) If "newer" data at the time that was lost is visible again and then replicates, does that hurt anything?

Again, this depends on how Cassandra is used. In RESTBase's data model it should mostly not matter (MVCC), but it could potentially matter for CAS use cases, and if we did distributed transactions. For the edit use case, it is for example possible that we'd miss an edit conflict if a save only reached the broken DC, but not (yet) the new "primary". Assuming that the broken DC eventually comes back, no data should however be lost permanently.

c) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase entries)

Currently, RESTBase is strictly a slave to MySQL data. If MySQL loses data about a revision in a switch-over that RESTBase has already stored, then this information would indeed be available from RESTBase without having a corresponding MySQL entry. Not a catastrophe, but something to keep in mind.

d) If QUORUM reads/writes are used, they might fail if 1 of 2 DCs are down...that implies a need for switch over...

We generally use localQuorum, which works on a quorum in the local DC. There is no concept of primary / secondary at the Cassandra layer.

e) How do we go read only and "wait for things to catch up" with scripts?

For RESTBase as it is right now, I see a couple of options:

  1. revoke write right rights at the Cassandra user level: safe and general, but a big hammer
  2. Set a flag in RESTBase that rejects all write (PUT / POST) accesses at the cassandra table layer. This might be generally useful ("Anterograde amnesia mode"), and shouldn't be much more than a couple of lines. Mainly need to think about how to signal this.
  3. Rely on PHP API requests erroring on read. If those reads are temporarily inconsistent, this could however result in things like corrupted Parsoid parses, which we don't want to store.

the inconsistency window is fairly small

So you provide a theoretical model, but not actual measurements. Even something as simple as an fsync() can be very unpredictable under load. Measure, measure, measure is the key to performance. (which is the point of this ticket)

Currently, RESTBase is strictly a slave to MySQL data. If MySQL loses data about a revision in a switch-over that RESTBase has already stored, then this information would indeed be available from RESTBase without having a corresponding MySQL entry. Not a catastrophe, but something to keep in mind.

I strongly disagree with this. (Not a catastrophe). Not deleting entries that were deleted on MySQL lead to security issues in the past.

If "RESTBase is strictly a slave to MySQL data", (I am not saying that is a good thing, but assuming that) shouldn't you have 2 separate clusters, one on each datacenter? Also, (question) at which exact point in the stack execution does data get inserted on Cassandra?

We should be very critical and skeptic with *all our data storages*. If someone is so optimistic about Availability and consistency is because he hasn't suffered enough. :-)

@aaron I do not understand which is the model that you want to audit:

  • A split brain scenario
  • A controled failover (switchover)
  • An emergency failover with or without some resources still running on the original datacenters
  • Recover from failover, being in read only mode or in read-write on a secondary datacenter

We are definitely not ready for some of that (e.g. there is not replication setup -yet- from a seconday datacenter back into the primary datacenter), and we should analyze them separately.

aaron added a comment.EditedOct 1 2015, 8:06 AM

@jcrespo: Sorry to be a bit all over the place. Indeed there multiple major failure cases that are interesting. We will end up having to look at several to have decent fail over plans. Phab subtasks will be useful here :-)

IMO, the #1 case we should work out first is totally controlled switch-over. Say the master DC has poor connectivity for lots users and we want to switch traffic the codfw. In that case, we have no unexpected crashes or partitions (assume we can ssh in just fine) and it's a matter of running scripts to:
a) go read only
b) wait for the slave DC to catch up (mariadb shines here due to replication using a serial log per channel, so it's well defined)
c) switch codfw to being the authoritative data source (e.g. master promotion, CDN rules)
d) go off read only

Getting that case down is a good starting point, and it should be the easiest. It would also be what we test once we think we have the scripts working.

mariadb shines here due to replication using a serial log per channel, so it's well defined

Same thing I said to Gwicke, let's not get so optimistic so early :-) (for example, there are 7 shards + 2 rw es servers, each with a replication channel).

Let me gather some thoughts and I will try to help define all pain points.

IMO, the #1 case we should work out first is totally controlled switch-over.

+1. Tearing down the problem into smaller pieces and going step by step is valuable in this case for other, future (catastrophic) cases.

Same thing I said to Gwicke, let's not get so optimistic so early :-)

I hear your concerns about suffering pain (and you are right), but let's not overreact and remember that currently we are suffering the most (as in, eqiad goes down and we are really down). Just an optimistic thought to get us through the process :)

So you provide a theoretical model, but not actual measurements. Even something as simple as an fsync() can be very unpredictable under load. Measure, measure, measure is the key to performance. (which is the point of this ticket)

Just a reminder that Cassandra is eventually-consistent so not propagating a write to the other DC can lead to minimal inconsistencies (factoring in RESTBase's data model). AFAIK, the same is true for the Maps service and the upcoming Analytics RESTBase cluster.

I strongly disagree with this. (Not a catastrophe). Not deleting entries that were deleted on MySQL lead to security issues in the past.

Right, but I think at this point these are just edge cases that ought to be kept in mind. For the initial assessment the overall picture we are drawing is that we can live with a couple of writes/deletes missing.

If "RESTBase is strictly a slave to MySQL data", (I am not saying that is a good thing, but assuming that) shouldn't you have 2 separate clusters, one on each datacenter?

As @GWicke noted, there is no concept of primary/secondary/tertiary DC. Also, we do not replicate MySQL data in the slave sense, writes and updates are induced by clients (see below the answer to your next question).

Also, (question) at which exact point in the stack execution does data get inserted on Cassandra?

There are two scenarios were data is written to Cassandra: (a) a client requests something that hasn't been stored yet; and (b) a request with the Cache-Control: no-cache header is received. The latter trick is used to track updates: we have a MW extension that hooks into relevant page updates (create, edit, delete, undelete, revdelete).

When the data is not found (or it's a no-cache request), a call to the MW API is placed for the revision info, and a request goes out to Parsoid for rendering. When these come back, they are stored in Cassandra and returned to the client.

GWicke added a comment.EditedOct 1 2015, 4:10 PM

I strongly disagree with this. (Not a catastrophe). Not deleting entries that were deleted on MySQL lead to security issues in the past.

This is not what this is about. The case I'm referencing here is a write that happened in MySQL & was replicated in Cassandra (no replication lag), but didn't make it to the second DC before switch-over at the MySQL layer due to replication lag. It's data loss, not a delete.

but didn't make it to the second DC at the MySQL layer due to replication lag

And that data loss can be a delete not replicated at the MySQL layer...

GWicke added a comment.EditedOct 1 2015, 4:14 PM

but didn't make it to the second DC at the MySQL layer due to replication lag

And that data loss can be a delete not replicated at the MySQL layer...

Right, in which case Cassandra might have the delete, but MySQL might not. The argument is about head-of-line replication blocking in MySQL vs. parallel execution in Cassandra's case.

That said, in RESTBase's case there are other aspects to consider in how changes propagate. RESTBase is updated asynchronously, based on the job queue (currently), and soon the Kafka event bus. The latter will be replicated as well, and should have reasonably low latency. RESTBase updates are idempotent, so we could consider re-running updates from the event bus after switch-over, to remove inconsistencies due to update propagation delays just before fail-over.

The upside of parallel replication is reasonable throughput and low replication latency, which minimizes the chances for data loss from replication lag. The downside of parallel replication is the chance for inconsistency between systems, especially when replication latencies differ. This can be partially addressed by making dependent updates idempotent, but it'll be difficult to completely eliminate the chance of some inconsistency without investing significantly into tracking causality through the system.

jcrespo added a comment.EditedOct 1 2015, 4:36 PM

Sorry, Cassandra is perfect,has no lag, and cannot lose data- I forgot about that.

While you discuss, I will work to make sure that MySQL's parallel replication is well configured, semi-sync is tuned for datacenter failover, while we still run on 3-year old spining disks.

(Check the history of the comment I am answering to)

GWicke added a comment.EditedOct 1 2015, 5:22 PM

Sorry, Cassandra is perfect,has no lag, and cannot lose data- I forgot about that.

@jcrespo, you seem to be arguing against a strawman. Nobody is saying that Cassandra is perfect & MySQL is not. They are two systems making different trade-offs (and I said as much in T114271#1693696), that's all.

GWicke added a comment.EditedOct 1 2015, 5:31 PM

I think it would make sense to more clearly define the scenarios we are interested in. Here is an attempt:

  1. Planned fail-over: Both DCs are up, we control when we switch to read-only mode, and when the final switch happens. To make testing this in production realistic, impact on users should be minimal. What this means exactly is up for discussion, but more than a couple minutes of read-only mode is probably something we'd want to avoid.
  1. Temporary fail-stop: Complete power loss or network cut in the former primary DC, but power / network eventually comes back.
  1. Temporary high packet loss: Network issues cause high packet loss to the primary DC. Some requests still get through. Depending on the source of the issue, replication might be affected (?).
  1. Permanent fail-stop: Old primary DC is hit by comet / burns down, hardware is permanently lost.
aaron added a comment.Oct 1 2015, 6:14 PM

As I said above, the planned fail-over is the first case to look at. Some questions I have for Cassandra are:

  • a) Since we don't do WAN quorum rights (full quorum as opposed to local quorum), how far behind could the remote DC nodes get before Cassandra stops accepting new writes? Forever? I get that it tries to write in parallel, but even if it is configured to make sure one of the writes is in the remote DC, it may fail and timeout. It may be fine if no network problems happen, but we can't assume that.
  • b) Are there Cassandra command scripts to "go read only" and "wait for all replicas for all logical shards to be in sync"? In the controlled switch case, we want to *know* that everything is in both places. We can do that with mariadb and master/slave position commands. We will want counterpart commands for Cassandra.
aaron added a comment.Oct 1 2015, 6:35 PM

I spun off T114398 for planned DC switch case, which is mostly about sync/promotion/config scripts.

Given that , I suppose this task can focus on the unplanned DR aspects. From the list @jrespo had, we can focus on:
a) "An emergency failover with or without some resources still running on the original datacenters".
b) "A split brain scenario" where usual switch over scripts might not work because the DCs can't talk to each other. Also when the partition ends, we don't want to stores thinking they a "leader" or something.

Things like "Temporary fail-stop". "Temporary high packet loss", temporary split-brain all are implicitly coupled to these because how each store reacts to those decides how much data is lost or mismatched when we do the emergency switch-over. I don't think we can look at them completely separately.

GWicke added a comment.EditedOct 2 2015, 12:50 AM

As I said above, the planned fail-over is the first case to look at. Some questions I have for Cassandra are:

  • a) Since we don't do WAN quorum rights (full quorum as opposed to local quorum), how far behind could the remote DC nodes get before Cassandra stops accepting new writes? Forever?

Yes, forever, if we don't look at alerts, and the network connection is permanently down. This is by choice, as implied by using localQuorum rather than quorum.

I get that it tries to write in parallel, but even if it is configured to make sure one of the writes is in the remote DC, it may fail and timeout. It may be fine if no network problems happen, but we can't assume that.

There are mechanisms like hints that will retry individual failed requests within a configurable time frame (several hours by default). The number of outstanding hints is one of the metrics we have alerts on. If the network is partitioned longer than the configured hint window, then a repair needs to be initiated to get both clusters in sync. We are planning to run those repairs at regular intervals, ideally once every 24 hours.

  • b) Are there Cassandra command scripts to "go read only"

Revoking write permissions would enforce this. See T114271#1691351 for other options.

and "wait for all replicas for all logical shards to be in sync"? In the controlled switch case, we want to *know* that everything is in both places.

A repair makes sure that this is the case even if there are long partitions. In the short term, having no writes and no outstanding hints is a very strong indication that everything is caught up. In normal operation, this is the case seconds after stopping writes. In any case, there is no master / slave replication relationship to switch, so there is no need to wait for everything to catch up in order to avoid data loss. Writing in both DCs simultaneously is fully supported, and outstanding write requests from one DC will finish normally when switching all app write operations from one DC to another.

jcrespo added a comment.EditedOct 5 2015, 10:41 AM

@aaron I am answering some of these questions

For replicated stores like mariadb slaves, we need to know:
a) How much data might be lost on switch over (how robust is "too lagged; read only" mode?)

Right now, the mysql model uses semi-sync replication. That means that, at least, in order to return COMMIT as successful, data has to be acknowledged on at least another server. However, semi-sync is not a fully-synchronous replication, and that is good for performance reasons, but has the following problems:

  • Only 1 server has to acknowledge it, which means the others can be lagged. In particular, normally, the server that acks will be closer, and not on a remote datacenter
  • Acknowledge doesn't mean apply (it is an "eventually consistent model"), which means that slaves can still see older data during some period. The lag measures the time the queries are commited, not when they arrive remotely, and can be applied no matter the master state
  • There is a 1 second timeout until any of the servers respond. If one doesn't within that second, it goes back to regular async mode for performance reasons.

With this model we can take that losing even 1 transaction while on normal state(more on this later) that has been acknowledged to the client within the same datacenter is pretty small, and it would require a full power down of different racks at the same time. However, the synchronous model is not be carried away to a different datacenter, as the roundtrip would be too large for it to work. I think that the expectations are, in the case of a datacenter-wide emergency, to be able to lose some data, with maybe manual recovery after the fact.

I mentioned "normal state", and the main issue why there can be lag is not purely technical- Master load is very low, compared to read data. Serialization is not an issue by itself (replication can be done in parallel, and out of order, but it is still serialized on the binary logs), However, the application, at times, performs multi-second transactions that, by its own nature, will take the same amount of time of being written to its slaves. There are several reasons why that is happening, and those are tracked on: T95501, but there are 2 things to avoid that- avoiding those transactions and switching to ROW-based replication. In general, though, lag to a separate datacenter is lower than 1 second 90% of the time, and lower than 10 seconds 99%. There is, however, frequently large spikes due to poorly written extensions and maintenance that have to either be fixed or banned.

If we go to the external storage store for the revision data we can see that lag never gets higher than 1 second between datacenters, due to its simpler data model (append only, key-value, all primary key updates).

Currently, lag on the secondary datacenter is *not monitored by mediawiki*, so it will not go to read only if on of the masters on the other datacenter is lagged. I think this should be the desirable modus operandi- report but not be blocked by a passive datacenter. However, mediawiki could be more aware of it, specially for maintenance operations.

b) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase/ES entries)

Yes. There is logical dependencies, but not functional. Let me explain. There is a lost record T26675 on the core databases pointing to a non-existent ES entry (this is logical corruption, the cluster it points to never existed). Mediawiki, however, continues working despite it being non-existent and other revisions are not affected. A similar thing could be said about restbase, etc.

In the event of an emergency failover, while being in readwrite mode, different shards could end up on different states. While I said before, there is no strong dependency between them, but there is some: user authentication relays on centralauth (s7) for the rest of the wikis. Some functionalities depend on tables on meta (centralnotice, oauth, etc.). Metadata for images usage is on each wiki separatelly, despite being mostly centralized on commons. Etc. While update should have a logical correlation, due to replication being async, you cannot guarantee that the metadata for a revision was stored, while the actual page didn't, and viceversa. Again, always in the context of the 2 separate datacenters with data loss.

It is true that those inconsistencies are usually only found moths after the fact by bots scanning all tables or labs users with direct table access.

c) How do we go read only and "wait for things to catch up" with scripts?

This is the easy part, if we consider first the option "controlled failover", setting the wikis on read-only mode, followed by setting mysqls in read-only will block all writes on the master and eventually (between 1 and 100 seconds), the secondary datacenter should have all data. I can help write it so that it makes the master-read-only, waits for lag until it reaches 0, then it sets the new master as read/write and starts pt-hearbeat there.

Things that can go badly here:

  • job queues are lost? We suffered this in the past. Tables like templatelinks, pagelinks, etc. are updated in the background. On a datacenter failover, ideally, we should wait for all background tasks to finish, otherwise those will not get executed. Again, they do not create user-created data loss, only some of the special pages "e.g. pages that link here" will become outdated, and should be fixable with manual cache purges.
  • Master fails at the same time than a datacenter failover is required. Pretty interesting, because in this case, we would go to read-only, so lag would not be an issue, but latests edits would be in 1) the transaction log of the failed master and 2) at least one of the primary slaves. We do not have right now a "fast" way to promote a slave. We should investigate it.
  • There is no currently active replication back from dallas -> eqiad. This is trivial to setup, I could even setup a circular replication active all the time. But right now I would like to avoid setting it up automatically because dallas is considered non-production by many people and that would make its master a target for "test work", despite being in read-only mode.

More issues:

parsercaches on codfw are right now non-existent (they point to eqiad pcs). They have to be setup. Probably, without replication (local only). However, I fear for its performance when non-warmed up (the same for memcached, at al levels).

How to avoid being crushed by our own cold caches? For regular MySQLs that is not an issue because replication makes sure all written rows are hot, but for other services like memcached, parsercache, etc., that could lead to mysqls being overloaded with connections.

Krinkle removed a subscriber: Krinkle.Oct 20 2015, 3:23 AM

Update: circular replication is tracked on T124698 and parsercaches on codfw are active and working (T121879).

Peter removed aaron as the assignee of this task.Sep 13 2017, 11:38 AM
Peter edited projects, added Availability (MediaWiki-MultiDC); removed Availability.
Peter moved this task from Backlog: Small & Maintenance to Radar on the Performance-Team board.
Peter edited projects, added Performance-Team (Radar); removed Performance-Team.