Page MenuHomePhabricator

Manage cross-DC replication according to network topology
Closed, ResolvedPublic

Description

At keyspace (logical table) creation time we currently set up static replication for the default 'datacenter1'. For multi-DC support we need to dynamically update these keyspaces to replicate to other datacenters too.

Example CQL query (see docs):

ALTER KEYSPACE "org_wikipedia_en_T_pages_html" WITH REPLICATION =
  { 'class' : 'NetworkTopologyStrategy', 'eqiad' : 3, 'codfw': 3  };

It is possible to discover information about the current topology dynamically. There is information about the available datacenters and nodes in system.peers. The local node's properties are in `system.local'. However, a dynamic update of replication settings (especially a removal of replicas) could be dangerous. Another issue is the coordination of such updates across the cluster. We wouldn't want a thundering herd of workers racing to update replication settings on all tables at the same time.

Both can be avoided by leaving such tasks to a designated management node (or script) working from a static config file. While not very elegant, this should provide a safe starting point. We can then look for a safe & distributed solution in the second step, for example with some simple ID-based leader election & a static config.

Event Timeline

GWicke renamed this task from Manage replication factors according to network topology to Manage cross-DC replication according to network topology.
GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke changed Security from none to None.
GWicke added a subscriber: GWicke.

The current algorithm for determining replication on CREATE KEYSPACE is to place 1 copy in localDc if durability is 'low', or 3 otherwise. I wonder if it wouldn't be enough for a first pass to instead, iterate the unique DCs (as grok'd from system.peers), and specify 1 or 3 replicas accordingly, per data-center.

This would require us to manually issue an ALTER for existing keyspaces when adding a new data-center, but that seems OK if documented clearly, adding a data-center strikes me as a pretty exceptional event.

The current algorithm for determining replication on CREATE KEYSPACE is to place 1 copy in localDc if durability is 'low', or 3 otherwise. I wonder if it wouldn't be enough for a first pass to instead, iterate the unique DCs (as grok'd from system.peers), and specify 1 or 3 replicas accordingly, per data-center.

This would require us to manually issue an ALTER for existing keyspaces when adding a new data-center, but that seems OK if documented clearly, adding a data-center strikes me as a pretty exceptional event.

A PR to implement the above: https://github.com/wikimedia/restbase-mod-table-cassandra/pull/138

After some discussion it has been decided that datacenters will be explicitly configured in RESTBase.

When a new table is created, it's corresponding keyspace will be created with replication according to the current algorithm (1 replica per datacenter if durability is low, 3 per datacenter otherwise), for every explicitly configured datacenter.

Upon startup, if cluster replication differs from that of the configuration, the underlying keyspace will be ALTERed accordingly.

Since ALTERing the keyspace here isn't sufficient to immediately change actual redundancy, the need for a repair should be verbosely documented everywhere applicable.

After some discussion it has been decided that datacenters will be explicitly configured in RESTBase.

When a new table is created, it's corresponding keyspace will be created with replication according to the current algorithm (1 replica per datacenter if durability is low, 3 per datacenter otherwise), for every explicitly configured datacenter.

Upon startup, if cluster replication differs from that of the configuration, the underlying keyspace will be ALTERed accordingly.

I think any behavior that changes the underlying "schema" (including in this case replication) should be invoked manually by default. Schema changes at startup are difficult to orchestrate and AFAIK have caused problems in the past with uncoordinated workers restarting, configuration deployment and so on.

I think any behavior that changes the underlying "schema" (including in this case replication) should be invoked manually by default.

Indeed, this is why we are carefully orchestrating deploys.

I think any behavior that changes the underlying "schema" (including in this case replication) should be invoked manually by default.

Indeed, this is why we are carefully orchestrating deploys.

I'm not sure careful orchestration is enough, once workers (or nodes) get the new configuration they are basically a loaded gun that can restart/trigger at any time

once workers (or nodes) get the new configuration they are basically a loaded gun that can restart/trigger at any time

That's why we orchestrate config deploys just like code deploys.

once workers (or nodes) get the new configuration they are basically a loaded gun that can restart/trigger at any time

That's why we orchestrate config deploys just like code deploys.

Note my original point was about schema changes at startup. I still think that's surprising behavior, especially if it requires careful coordination, and risking an outage if something goes wrong. Of course this applies to all restbase operators, not just WMF

We could make schema changes a separate manual step, to be performed manually or by the deploy system before a full cluster deploy. However, this wouldn't fundamentally change the risk profile of the schema migration itself, and would add some complexity. Manual schema migration would also add an additional failure scenario where the schema upgrade is forgotten, and new code that depends on it then fails on deploy. This would then abort the deploy once the failure threshold is reached (effectively one box in our config), but some clients might see errors from new code running on an old schema.

Note that all schema changes need to be backwards-compatible, so that old code temporarily running on a newer schema works okay. Schema changes also need to be really fast, as in seconds (at most). The most common schema change so far has been adding columns, or adjusting settings.

So far, our deploy process with mandatory staging tests and automated rolling deploys with failure detection & automated aborts has worked well for us. We haven't had any issues from schema changes in production.

Of course this applies to all restbase operators, not just WMF

I hope that the release engineering team will soon have a system that is easy to use for third-party users as well, so that they too can benefit from the same deployment automation we are using in production.

We could make schema changes a separate manual step, to be performed manually or by the deploy system before a full cluster deploy. However, this wouldn't fundamentally change the risk profile of the schema migration itself, and would add some complexity. Manual schema migration would also add an additional failure scenario where the schema upgrade is forgotten, and new code that depends on it then fails on deploy. This would then abort the deploy once the failure threshold is reached (effectively one box in our config), but some clients might see errors from new code running on an old schema.

The risk profile to my eye with the status quo is changing the service configuration + restart -> schema change happens -> all other hosts could be broken and no obvious way on what caused it and how to rollback. In other words a local change (to the configuration) results in global/cluster effects upon service restart.

Note that all schema changes need to be backwards-compatible, so that old code temporarily running on a newer schema works okay. Schema changes also need to be really fast, as in seconds (at most). The most common schema change so far has been adding columns, or adjusting settings.

If fast schema changes can happen for each restbase backend that's great, why seconds and not minutes?

So far, our deploy process with mandatory staging tests and automated rolling deploys with failure detection & automated aborts has worked well for us. We haven't had any issues from schema changes in production.

for the record I'm not saying it is a bad idea per se, it is surprising behaviour though

Of course this applies to all restbase operators, not just WMF

I hope that the release engineering team will soon have a system that is easy to use for third-party users as well, so that they too can benefit from the same deployment automation we are using in production.

that would be great, less great if a particular deployment system is required for safe operations

The risk profile to my eye with the status quo is changing the service configuration + restart -> schema change happens -> all other hosts could be broken and no obvious way on what caused it and how to rollback.

Random config changes do not normally trigger schema migrations (in the add-column sense). So far, all schema changes have been initiated by code changes. Config changes can trigger the creation of new tables, for example for a new set of domains, but won't normally touch existing data.

In storage systems, the consequences of mistakes are more severe than for stateless services. For RESTBase, we are addressing this by having thorough test and deploy procedures, for both code and configs. By providing high-level storage abstractions with services like RESTBase, we aim to keep most other services stateless, so that they don't need to deal with challenges of storage, replication and consistency.

why seconds and not minutes

All the schema changes we support as part of the start-up sequence end up only updating metadata in Cassandra, which means that they are constant-time and fast. For now, expensive operations like index re-builds, cluster re-shaping (ex: changing replication settings) etc will require manual intervention to make the system fully consistent. In the meantime, the metadata migration on deploy can keep the cluster running with up-to-date schema metadata.

Change 238738 had a related patch set uploaded (by Eevans):
configure datacenter set

https://gerrit.wikimedia.org/r/238738

Change 238738 merged by Filippo Giunchedi:
configure datacenter set

https://gerrit.wikimedia.org/r/238738

We just deployed the configuration change in production that exercised this code, and updated replication there, but not entirely without issue.

Once the first node was restarted, the process of altering the keyspaces to add codfw to replication kicked off, as expected. What was unexpected was that almost as quickly as they were being changed from {eqiad: 3}, to {eqiad:3, codfw: 3}, they were being changed back. It would seem that worker restarts are reevaluating replication (in this case, according to the outdated config). Once all of the nodes were restarted, replication converged on the desired state, but not before a great deal of back-and-forth.

I agree that schema changes are potentially dangerous, but that stems from the fact that RESTBase is stateless while Cassandra isn't. I second @GWicke's thoughts:

All the schema changes we support as part of the start-up sequence end up only updating metadata in Cassandra, which means that they are constant-time and fast. For now, expensive operations like index re-builds, cluster re-shaping (ex: changing replication settings) etc will require manual intervention to make the system fully consistent. In the meantime, the metadata migration on deploy can keep the cluster running with up-to-date schema metadata.

I'd add to this we should document such manual changes that we undertake, for our sake, but also for the sake of third-party installations.

What was unexpected was that almost as quickly as they were being changed from {eqiad: 3}, to {eqiad:3, codfw: 3}, they were being changed back.

We overlooked that module config changes aren't covered by schema or backend versioning, which means that they aren't idempotent.

The main option I see to fix this is to add a module config version, and refuse to perform any migrations unless the version was incremented. If the version is the same, but the config was changed in ways that would require a migration, then we should probably fail loudly & complain about the broken config. When the local config has an older config version than the stored config version, then no migrations are attempted. Startup should proceed successfully, but logging a warning might be useful to make sure this state isn't persisting for long.

For all of this to work, we'll have to persist the config version in the stored schema, just like with backend & schema versions.

Regarding the implementation of this:

This seems like something that would be better handled with it's own attribute in meta (meta.config_version?), rather than adding one to the meta.value JSON blob. Technically speaking (like _backend_version), this isn't a part of the table schema, and storing it there will mean scrubbing the version out of that value in order to make the comparisons needed for schema migration. All of which gets messier/more brittle the more things we add.

I feel like I might have raised something similar once before, back when _backend_version was first introduced, but I don't remember the rationale. Does anyone remember the reason for not using a separate attribute for _backend_version when that discussion was taking place?

storing it there will mean scrubbing the version out of that value in order to make the comparisons needed for schema migration

We still need to keep versions up to date, so need to look at those to see if anything needs to be updated at all (including the backend version).

I do agree that _backend_version differs from the public spec version by being private to the backend implementation. Whether this would make it a good candidate for a separate column is a bit less clear to me. The impact on implementation complexity seems to be fairly minor either way.

A related consideration we had in the past was whether the meta table is actually worth having. Cassandra also supports table comments. The advantage would potentially be less memory used for extra tables. The disadvantage would be the loss of versioning on metadata updates. We don't really leverage this, but it could potentially come in handy in disaster recover.

storing it there will mean scrubbing the version out of that value in order to make the comparisons needed for schema migration

We still need to keep versions up to date, so need to look at those to see if anything needs to be updated at all (including the backend version).

I do agree that _backend_version differs from the public spec version by being private to the backend implementation. Whether this would make it a good candidate for a separate column is a bit less clear to me. The impact on implementation complexity seems to be fairly minor either way.

The primary purpose of the schema version (the one embedded in meta.value), is to determine the validity of a schema change; We rely upon a comparison of the schema hashes to determine if a change is necessary (which is why it's OK that version is embedded in the hashed value). This is what makes it different from _backend_version (and it's why we have to strip _backend_version when calculating the hash. The same would be true for the config version.

I was thinking the schema could be updated to look like:

{
  table: 'meta',
  attributes: {
    key: 'string',
    value: 'json',
    tid: 'timeuuid',
    config_version: int,
    backend_version: int
  },
  index: [
    { attribute: 'key', type: 'hash' },
    { attribute: 'tid', type: 'range', order: 'desc' }
  ],
  secondaryIndexes: {}
 }

On startup, we would:

  1. Evaluate meta.config_version, and if different (higher), then 'migrate' the settings accordingly
  2. Evaluate meta.backend_version, and if different (higher), then 'migrate' accordingly
  3. Evaluate whether persisted hash differs from the currently computed one, and if different (and if version is higher), then migrate accordingly
  4. If any of 1, 2, or 3 above resulted in change, persist to meta

Another alternative, is to go the other direction; To embed this new config version and _backend_version in value (i.e. make them a part of the resulting hash), and use the comparison of hashes to determine if any of config, backend, or table schema need changing. If the hashes do not match, then each of these 3 categories would have their respective versions evaluated separately, and respective changes applied. I think this could work, but it will require some refactoring of schema migration, since its design makes the assumption that we already know the table schema are different, and that a migration is required. And ultimately, this refactoring will take us somewhat full circle in that it will need to make some targeted comparisons of schema content (relying on version alone seems unsafe).

TL;DR I think I'd prefer that we either make value encapsulate table schema only, and add Cassandra column family attributes for additional meta data (i.e. my previous comment), or that we suck it up and add another exception to hash calculation, this time exempting the config version. The former seems a bit more 'proper', but also more invasive. The latter is the path of least-code, (and I could probably be convinced that's the most pragmatic).

TL;DR I think I'd prefer that we either make value encapsulate table schema only, and add Cassandra column family attributes for additional meta data (i.e. my previous comment), or that we suck it up and add another exception to hash calculation, this time exempting the config version. The former seems a bit more 'proper', but also more invasive. The latter is the path of least-code, (and I could probably be convinced that's the most pragmatic).

This WIP pull request implements the latter of these. Thoughts?

Conceptually, I think having the hash cover everything that could trigger a schema migration is the cleanest (and fastest) approach. This is how we started; we just ended up adding special-case checks for the versions.

If the hashes do not match, then each of these 3 categories would have their respective versions evaluated separately, and respective changes applied. I think this could work, but it will require some refactoring of schema migration, since its design makes the assumption that we already know the table schema are different, and that a migration is required.

From what I can see in the code, all of the migrators are already idempotent. For example, if no columns are added or removed, then the attribute migrator ends up doing nothing. They better be, as each of them is called whenever any aspect of the schema changed. @Eevans, do you see any migrator that isn't idempotent yet?

Overall, I think the code structure wouldn't need to change much. The main changes seem to be:

  • include versions in hash
  • return early if the hash matches
  • else, apply each of the migrations, each with their own version / schema equality check.

While looking at the code, I noticed that we currently support dropping columns. This seems to be risky, as it can break older code. Should we comment out this capability?

Another side effect of special-casing the backend and config migrations is that we don't actually have an atomic validation stage any more, as originally envisioned. To remedy this, we should probably model the backend and config migrators on the version migrator, so that we get a clean validation pass *before* any migration is actually performed.

Another simplification is that we should be able to get rid of the schemaWriteNeeded flag. Basically, whenever the hash differs and validation / migration succeeds, the schema needs to be written.

IIRC when we discussed this on IRC the last time we were wondering about how to best implement the dependency between version check & actual changes. I just realized that we could just let each migrator take care of that in its validator stage. Basically,

Prop.prototype.validate = function(parentMigrator, current, proposed) {
  // Throws if current.prop differs from proposed.prop, 
  // and the version wasn't incremented.
  if (hasChanged(current.prop, proposed.prop, current.version, proposed.version)) {
    this.changes = ...;
  }
}

Main change is that we'd pass in the full schema into each schema, and define a small helper for the check.

Edit: This is now implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/pull/166.