Page MenuHomePhabricator

Cassandra schema creation seems unreliable
Open, MediumPublic

Description

Cassandra schema creation in RESTBase has always been somewhat flaky, but it seems to have gotten worse in the new cluster. Presumably this would be the result of either changes in the newer version of Cassandra (currently 3.11.0), our new application code, or the differing conditions/load imposed by the new code on the cluster (or even some combination).

Queries to Cassandra upstream would seem to suggest that no one else is having this issue in the 3.11.x series, and while this doesn't preclude the possibility of a regression, it should be enough to warrant a thorough assessment of what can be done on our side before pursuing it further (we have had serious bugs in this code on more than one occasion).

One straightforward improvement would be to serialize schema changes on a single node. Currently, when RESTBase issues Cassandra schema changes, it uses a general purpose connection pool created at startup, one that is populated with all nodes in the cluster. This means it is possible that each schema altering statement could be executed on a different host, which is not recommended.

Additionally, RESTBase aggressively retries for failures, some of which may be timeouts while waiting for schema agreement. If the "failure" is simply that a schema mutation took longer to make it's way around the cluster and reach agreement, then following that up with a retry of the same statement is a form of concurrency in its own right, and only likely to make matters worse.

Some prototype code was put together that uses a single connection pool ala a WhiteListPolicy to ensure that schema changes are serialized to one host. A script was created to use this to issue DDL statements as defined in a YAML formatted file. Additionally, the script allows for the max agreement wait time to be increased, as well as performing a separate validation of agreement before continuing with subsequent statements. This was tested by manually creating the tables necessary for some recent migrations. The results, while still not perfect, are very much improved over what we were seeing when RESTBase auto-generated schema.

At a minimum, we should update RESTBase's schema creation and migration code to incorporate these techniques. However, perhaps it would be worth going a bit further:

Update: 2018-07-03

Presumably this would be the result of either changes in the newer version of Cassandra (currently 3.11.0), our new application code, or the differing conditions/load imposed by the new code on the cluster (or even some combination).

A number of factors conspire to cause most of the hosts in the new cluster to be IO-bound (increased read/write amplification of the new strategy, aberrant performance of the Samsung SSDs (devices purchased outside of approved channels), and the move to JBOD meant to eliminate the blast-radius of a single failed device). It is these IO-bound hosts that limit the progression of schema modifications. This would seem to be the sole cause of any new instability between the previous cluster, and this one.

One straightforward improvement would be to serialize schema changes on a single node. Currently, when RESTBase issues Cassandra schema changes, it uses a general purpose connection pool created at startup, one that is populated with all nodes in the cluster. This means it is possible that each schema altering statement could be executed on a different host, which is not recommended.

Additionally, RESTBase aggressively retries for failures, some of which may be timeouts while waiting for schema agreement. If the "failure" is simply that a schema mutation took longer to make it's way around the cluster and reach agreement, then following that up with a retry of the same statement is a form of concurrency in its own right, and only likely to make matters worse.

Some prototype code was put together that uses a single connection pool ala a WhiteListPolicy to ensure that schema changes are serialized to one host. A script was created to use this to issue DDL statements as defined in a YAML formatted file. Additionally, the script allows for the max agreement wait time to be increased, as well as performing a separate validation of agreement before continuing with subsequent statements. This was tested by manually creating the tables necessary for some recent migrations. The results, while still not perfect, are very much improved over what we were seeing when RESTBase auto-generated schema.

Serializing each migration, doing so from a single node, and robustly guaranteeing agreement before proceeding improves the situation immensely (it makes the process reliable, if slow), but the only correct solution is to eliminate the IO-bounding.


Proposal

Decouple schema creation and migration from RESTBase startup; Utilize an (idempotent) schema upgrade script, issued prior to code and/or config deploy (similar to Mediawiki, for example).

Rationale

RESTBase's attempts at making schema creation and migration fully automatic has been a recurring source of issues, (despite the non-trivial amount of resources spent to make it work). That it works this way at all is an attempt at optimizing the installation and upgrade process so as to abstract Cassandra specifics from third-party users. However, it seems unlikely that third-parties with a use for RESTBase (at the scale that justifies Cassandra) exist in sufficient numbers to warrant the effort, or that they lack the sophistication for a more manual process (or that schema management is even enough to usefully abstract Cassandra from users).

Decoupling the process allows for a simpler, easier to maintain code base. Database changes can occur in the background, without disruption to the running application, and failures can be caught and dealt with long before a deploy.

Event Timeline

Eevans created this task.Oct 26 2017, 4:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2017, 4:30 PM
Eevans triaged this task as Medium priority.Oct 26 2017, 4:30 PM
Eevans moved this task from Backlog to Next on the User-Eevans board.

I put together a script to use in troubleshooting schema agreement, which also needs to pin queries to a specific node; Lines 81-97 below are a working example of how to do this:

1// -*- mode: js2; -*-
2"use strict";
3
4
5const align = require('string-align');
6const cassandra = require('cassandra-driver');
7const colors = require('colors/safe');
8const fs = require('fs');
9const yaml = require('js-yaml');
10const P = require('bluebird');
11
12const balancing = cassandra.policies.loadBalancing;
13const green = colors.green;
14const red = colors.red;
15const yellow = colors.yellow;
16
17const argv = require('yargs')
18 .usage('Usage: $0 <keyspace> <table>')
19 .demandCommand(2)
20 .options('h', {alias: 'help'})
21 .options('H', {
22 alias: 'hostname',
23 default: 'localhost',
24 describe: 'Contact hostname',
25 type: 'string'
26 })
27 .options('P', {
28 alias: 'port',
29 default: 9042,
30 describe: 'Contact port number',
31 type: 'number'
32 })
33 .options('u', {
34 alias: 'username',
35 default: 'cassandra',
36 describe: 'Cassandra username',
37 type: 'string'
38 })
39 .options('p', {
40 alias: 'password',
41 default: 'cassandra',
42 describe: 'Cassandra password',
43 type: 'string'
44 })
45 .options('c', {
46 alias: 'config',
47 describe: 'RESTBase configuration file',
48 type: 'string'
49 })
50 .argv;
51
52if (argv.help) {
53 yargs.showHelp();
54 process.exit(0);
55}
56
57const host = argv.hostname;
58const port = argv.port;
59const keyspace = argv._[0];
60const table = argv._[1];
61const contact = `${host}:${port}`;
62
63var user, pass;
64
65function getConfig(fname) {
66 const config = yaml.safeLoad(fs.readFileSync(fname));
67 return config.default_project['x-modules'][0].options.table;
68}
69
70// Parse the user/pass from a RESTBase config, if supplied.
71if (argv.config) {
72 const conf = getConfig(argv.config);
73 user = conf.username;
74 pass = conf.password;
75}
76else {
77 user = argv.username;
78 pass = argv.password;
79}
80
81/** Creates a single entry whitelisting loadbalancing policy. */
82const whitelist = (contact) => {
83 return new balancing.WhiteListPolicy(new balancing.RoundRobinPolicy(), [ contact ]);
84};
85
86/** Creates a single connection pool. */
87function connect(contact, username, password) {
88 const client = new cassandra.Client({
89 contactPoints: [ contact ],
90 authProvider: new cassandra.auth.PlainTextAuthProvider(username, password),
91 sslOptions: { ca: '/dev/null' },
92 promiseFactory: P.fromCallback,
93 queryOptions: { consistency: cassandra.types.consistencies.one },
94 policies: { loadBalancing: whitelist(contact) }
95 });
96 return client.connect().then(() => client);
97}
98
99function getNodesList() {
100 return connect(contact, user, pass)
101 .then((client) => {
102 return P.resolve(client.hosts.keys()).finally(() => client.shutdown());
103 });
104}
105
106const cqlQuery = 'SELECT keyspace_name,table_name,id FROM system_schema.tables WHERE keyspace_name=? AND table_name=?';
107
108getNodesList()
109 .then((nodes) => {
110 return P.each(nodes, (node) => {
111 return connect(node, user, pass)
112 .then((client) => {
113 let lastId;
114 let colorFun = green;
115 return client.execute(cqlQuery, [keyspace, table])
116 .then((res) => {
117 if (res.rows.length < 1) {
118 console.log(`${yellow(align(node, 20, 'left'))}`, red('| NOT FOUND'));
119 }
120 res.rows.forEach((row) => {
121 if (lastId && (lastId !== row.id))
122 colorFun = colors.red;
123
124 console.log(yellow(align(node, 20, 'left')),
125 colorFun(`| ${row.keyspace_name} | ${row.table_name} | ${row.id}`));
126
127 lastId = row.id;
128 });
129 })
130 .finally(() => client.shutdown());
131 });
132 });
133 });

Based on @Eevans' script, I made another one to create the new Parsoid tables and ran it on the production cluster. There were still quite a number of schema disagreement errors in spite of just one node being in charge of all of the creations.

Thanks @mobrovac

That's disappointing, but not totally surprising I guess; We've always had this type of concurrency, and while it's always been janky, this last batch seemed to exhibit a new level of badness. But, maybe it was at least an improvement over last time, I just audited the keyspaces you created and at least they're all there (and match).

Looking at Kibana, I see just 5 ConfigurationExceptions, all for the same ID, and all within a 2 minute period on the same codfw node (I guess this is the same "mismatch event"):

Most of the exceptions from that period are actually UnknownColumnFamilyExceptions, logged as WARNING, and are somewhat less awful. The former means that two nodes disagree on what the current version of the schema should be, and the latter means that the node in question doesn't know about the table (because for example it has not received the schema mutation yet).

When did you kick off the schema changes, and (approximately) how long did they take? The ConfigurationExceptions happen at ~09:23, and the UnknownColumnFamilyExceptions from ~10:55 to ~11:20. I see a deployment logged at 10:27. If the schema was created at ~9:30 and writes didn't start arriving until an hour later, then those UnknownColumnFamilyExceptions would seem to indicate that the schema hadn't yet fully propagated in that time.


I had been hoping to manually create this batch of keyspaces, doing them one at a time, in a very controlled fashion. For example: creating the keyspace, determining how long it took to completely propagate (or if it failed to make progress), what if any exceptions were logged, then moving to each of the contained tables and doing the same (which incidentally is why I wrote the above script).

There are more keyspaces to create though, so maybe we can do this for the next set.

Here is a trimmed down variation of your script that just outputs the DDL statements:

1"use strict";
2
3
4const crypto = require('crypto');
5const colors = require('colors/safe');
6const P = require('bluebird');
7const yargs = require('yargs');
8
9const green = colors.green;
10const red = colors.red;
11const yellow = colors.yellow;
12
13
14function hashKey(key) {
15 return new crypto.Hash('sha1')
16 .update(key)
17 .digest()
18 .toString('base64')
19 // Replace [+/] from base64 with _ (illegal in Cassandra)
20 .replace(/[+/]/g, '_')
21 // Remove base64 padding, has no entropy
22 .replace(/=+$/, '');
23}
24
25
26function getValidPrefix(key) {
27 const prefixMatch = /^[a-zA-Z0-9_]+/.exec(key);
28 if (prefixMatch) {
29 return prefixMatch[0];
30 } else {
31 return '';
32 }
33}
34
35
36function makeValidKey(key, length) {
37 const origKey = key;
38 key = key.replace(/_/g, '__')
39 .replace(/\./g, '_');
40 if (!/^[a-zA-Z0-9_]+$/.test(key)) {
41 // Create a new 28 char prefix
42 const validPrefix = getValidPrefix(key).substr(0, length * 2 / 3);
43 return validPrefix + hashKey(origKey).substr(0, length - validPrefix.length);
44 } else if (key.length > length) {
45 return key.substr(0, length * 2 / 3) + hashKey(origKey).substr(0, length / 3);
46 } else {
47 return key;
48 }
49}
50
51
52function keyspaceName(name, table) {
53 const reversedName = name.toLowerCase().split('.').reverse().join('.');
54 const prefix = makeValidKey(reversedName, Math.max(26, 48 - table.length - 3));
55 // 6 chars _hash_ to prevent conflicts between domains & table names
56 const res = `${prefix}_T_${makeValidKey(table, 48 - prefix.length - 3)}`;
57 return res;
58}
59
60
61function cassID(name) {
62 if (/^[a-zA-Z0-9_]+$/.test(name)) {
63 return `"${name}"`;
64 } else {
65 return `"${name.replace(/"/g, '""')}"`;
66 }
67}
68
69
70const tables = {
71 'parsoid_ng.data-parsoid': 'text',
72 'parsoid_ng.html': 'blob',
73 'parsoid_ng-render-timeline': '',
74 'parsoid_ng-revision-timeline': '',
75 'parsoid_ng.section-offsets': 'text'
76};
77
78const storages = [
79 'enwiki',
80 'commons',
81 'wikipedia',
82 'others'
83];
84
85const qKs = `CREATE KEYSPACE IF NOT EXISTS <keyspace> WITH replication = {'class': 'NetworkTopologyStrategy', 'codfw': '3', 'eqiad': '3'} AND durable_writes = true;`;
86
87const qMeta = `CREATE TABLE IF NOT EXISTS <keyspace>.meta (
88 key text PRIMARY KEY,
89 value text
90) WITH bloom_filter_fp_chance = 0.1
91 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
92 AND comment = ''
93 AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
94 AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
95 AND crc_check_chance = 1.0
96 AND dclocal_read_repair_chance = 0.1
97 AND default_time_to_live = 0
98 AND gc_grace_seconds = 864000
99 AND max_index_interval = 2048
100 AND memtable_flush_period_in_ms = 0
101 AND min_index_interval = 128
102 AND read_repair_chance = 0.0
103 AND speculative_retry = '99PERCENTILE';`;
104
105const qData = `CREATE TABLE IF NOT EXISTS <keyspace>.data (
106 "_domain" text,
107 key text,
108 rev int,
109 tid timeuuid,
110 "content-location" text,
111 "content-type" text,
112 tags set<text>,
113 value <type>,
114 PRIMARY KEY (("_domain", key), rev, tid)
115) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
116 AND bloom_filter_fp_chance = 0.01
117 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
118 AND comment = ''
119 AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
120 AND compression = {'chunk_length_in_kb': '256', 'class': 'org.apache.cassandra.io.compress.DeflateCompressor'}
121 AND crc_check_chance = 1.0
122 AND dclocal_read_repair_chance = 0.1
123 AND default_time_to_live = 0
124 AND gc_grace_seconds = 86400
125 AND max_index_interval = 2048
126 AND memtable_flush_period_in_ms = 0
127 AND min_index_interval = 128
128 AND read_repair_chance = 0.0
129 AND speculative_retry = '99PERCENTILE';`;
130
131const qDataRevision = `CREATE TABLE IF NOT EXISTS <keyspace>.data (
132 "_domain" text,
133 key text,
134 ts timestamp,
135 rev int,
136 PRIMARY KEY (("_domain", key), ts)
137) WITH CLUSTERING ORDER BY (ts DESC)
138 AND bloom_filter_fp_chance = 0.1
139 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
140 AND comment = ''
141 AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
142 AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
143 AND crc_check_chance = 1.0
144 AND dclocal_read_repair_chance = 0.1
145 AND default_time_to_live = 864000
146 AND gc_grace_seconds = 864000
147 AND max_index_interval = 2048
148 AND memtable_flush_period_in_ms = 0
149 AND min_index_interval = 128
150 AND read_repair_chance = 0.0
151 AND speculative_retry = '99PERCENTILE';`;
152
153const qDataRender = `CREATE TABLE IF NOT EXISTS <keyspace>.data (
154 "_domain" text,
155 key text,
156 rev int,
157 ts timestamp,
158 tid timeuuid,
159 PRIMARY KEY (("_domain", key), rev, ts)
160) WITH CLUSTERING ORDER BY (rev DESC, ts DESC)
161 AND bloom_filter_fp_chance = 0.1
162 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
163 AND comment = ''
164 AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
165 AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
166 AND crc_check_chance = 1.0
167 AND dclocal_read_repair_chance = 0.1
168 AND default_time_to_live = 864000
169 AND gc_grace_seconds = 864000
170 AND max_index_interval = 2048
171 AND memtable_flush_period_in_ms = 0
172 AND min_index_interval = 128
173 AND read_repair_chance = 0.0
174 AND speculative_retry = '99PERCENTILE';`;
175
176
177const argv = yargs.usage('Usage: $0 [options]')
178 .options('h', {alias: 'help'})
179 .argv;
180
181if (argv.help) {
182 yargs.showHelp();
183 process.exit(0);
184}
185
186return P.each(storages, (storage) => {
187 return P.each(Object.keys(tables), (table) => {
188 const keyspace = cassID(keyspaceName(storage, table));
189 console.log(`\n-- Keyspace: ${keyspace}`);
190 console.log(qKs.replace('<keyspace>', keyspace));
191 console.log(`\n-- Meta: ${keyspace}.meta`);
192 console.log(qMeta.replace('<keyspace>', keyspace));
193 console.log(`\n-- Data: ${keyspace}.data`);
194 let dataCQL = qData;
195 if(/render/.test(table)) {
196 dataCQL = qDataRender;
197 } else if(/revision/.test(table)) {
198 dataCQL = qDataRevision;
199 }
200 console.log(dataCQL.replace('<keyspace>', keyspace).replace('<type>', tables[table]));
201 console.log('\n-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~');
202 });
203});

Eevans renamed this task from Serialize schema changes on a single node to Cassandra 3.11.0 schema creation seems unreliable.Oct 30 2017, 4:48 PM
Eevans raised the priority of this task from Medium to High.
Eevans updated the task description. (Show Details)
Eevans moved this task from Next to In-Progress on the User-Eevans board.

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:08:03Z] <urandom> T179083: Creating keyspace enwiki_T_parsoid_stash_htmlmXxc_uDhgnFQAdM8PPlH5

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:14:43Z] <urandom> T179083: Creating table enwiki_T_parsoid_stash_htmlmXxc_uDhgnFQAdM8PPlH5.meta

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:20:41Z] <urandom> T179083: Creating table enwiki_T_parsoid_stash_htmlmXxc_uDhgnFQAdM8PPlH5.data

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:25:41Z] <urandom> T179083: Creating keyspace enwiki_T_parsoid_stash_wikitextf0PBY8UXqY8UuiDv1

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:27:00Z] <urandom> T179083: Creating table enwiki_T_parsoid_stash_wikitextf0PBY8UXqY8UuiDv1.meta

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:28:16Z] <urandom> T179083: Creating table enwiki_T_parsoid_stash_wikitextf0PBY8UXqY8UuiDv1.data

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:33:43Z] <urandom> T179083: Creating enwiki_T_parsoid_stash_dataWH8IDUS9SGI6LPpsJsLOQ

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:34:34Z] <urandom> T179083: Creating enwiki_T_parsoid_stash_dataWH8IDUS9SGI6LPpsJsLOQ.meta

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:36:30Z] <urandom> T179083: Creating enwiki_T_parsoid_stash_dataWH8IDUS9SGI6LPpsJsLOQ.data

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:46:53Z] <urandom> T179083: Creating enwiki_T_parsoid_stash_section2ACMDK1DRzacK9nUB3.{data,meta} (15sec delay)

Mentioned in SAL (#wikimedia-operations) [2017-10-30T20:56:06Z] <urandom> T179083: Creating commons_T_parsoid_stash_htmlmXxc_uDhgnFQAdM8PPlH.{data,meta} (25 sec delay)

Mentioned in SAL (#wikimedia-operations) [2017-10-30T21:02:45Z] <urandom> T179083: Creating commons_T_parsoid_stash_wikitextf0PBY8UXqY8UuiDv.{meta,data} (30 sec delay)

Eevans added a comment.EditedOct 30 2017, 10:07 PM

There are more keyspaces to create though, so maybe we can do this for the next set.

I did manually create the keyspaces and tables for Parsoid stashing (P6225).

One thing that stood out, is a mismatch which occurred early (when I was creating the tables more aggressively), that continued to log long after the schema mutation that triggered it. The IDs in question do not seem to correspond with any existing schema.

There are also seems to be a fair number of requests (~1/5) that return an OperationTimedOut (even though the action succeeded):

OperationTimedOut: errors={'10.64.32.187': 'Request timed out while waiting for schema agreement. See Session.execute[_async](timeout) and Cluster.max_schema_agreement_wait.'}, last_host=10.64.32.187

Given the relatively small size of this cluster, it seems curious that agreement would take so long.

One thing that stood out, is a mismatch which occurred early (when I was creating the tables more aggressively), that continued to log long after the schema mutation that triggered it. The IDs in question do not seem to correspond with any existing schema.
There are also seems to be a fair number of requests (~1/5) that return an OperationTimedOut (even though the action succeeded):

This is the effect that I observed as well.

Given the relatively small size of this cluster, it seems curious that agreement would take so long.

Yes, it is. I wonder what will happen when we need to create new keyspaces when we have the full Cassandra 3 cluster up and running. This definitely looks like a regression compared to Cassandra 2.

Given the relatively small size of this cluster, it seems curious that agreement would take so long.

Yes, it is. I wonder what will happen when we need to create new keyspaces when we have the full Cassandra 3 cluster up and running. This definitely looks like a regression compared to Cassandra 2.

It is also interesting to note that we do not observe such problems in Beta, where we have 2 nodes. Compared to prod, where we have 3 nodes per DC, it makes me wonder whether there are problems/regressions in the inter-DC Cassandra communication pattern.

Given the relatively small size of this cluster, it seems curious that agreement would take so long.

Yes, it is. I wonder what will happen when we need to create new keyspaces when we have the full Cassandra 3 cluster up and running. This definitely looks like a regression compared to Cassandra 2.

It is also interesting to note that we do not observe such problems in Beta, where we have 2 nodes. Compared to prod, where we have 3 nodes per DC, it makes me wonder whether there are problems/regressions in the inter-DC Cassandra communication pattern.

I suspect it's more a function of just the number of nodes; We have 2 nodes in deployment-prep, 6 in dev (2 instances x 3 hosts), and 18 in production (3 instances * 6 nodes). And FWIW, we have had some issues with schema creation in dev (not to this degree, but more so than deployment-prep).

Eevans added a comment.Nov 2 2017, 1:26 PM

Looking at Kibana, I see just 5 ConfigurationExceptions, all for the same ID, and all within a 2 minute period on the same codfw node (I guess this is the same "mismatch event"):

For posterity sake, this is the global schema ID (the schema_version attribute for the system.{local,peers} tables). The previous script I posted (P6208) was looking at the cfId (from system_schema.tables), which is not the same. New code that evaluates agreement based on the global schema ID can be found on Github.

Most of the exceptions from that period are actually UnknownColumnFamilyExceptions, logged as WARNING, and are somewhat less awful. The former means that two nodes disagree on what the current version of the schema should be, and the latter means that the node in question doesn't know about the table (because for example it has not received the schema mutation yet).

On the other hand, this is a cfId from system_schema.tables, and these seem to be persisting well past any schema changes. For example, over the past hour:

Drilling down, these all seem to be for the same cfId (d868f1d0-9e2e-11e7-8750-73770f42474d), and the exception is triggered by a read.

Which, as near as I can tell, does not exist anywhere.

Eevans added a comment.Nov 3 2017, 2:39 PM

Ok, a bit more info:

All of the thrown UnconfiguredColumnFamilyExceptions in the last week are for cfIds d868f1d0-9e2e-11e7-8750-73770f42474d and df90c320-9e2e-11e7-a592-958801a49e5b. Neither of these is "known" by any of the nodes (in the sense that they are not referenced in any of the system_schema.tables tables). These cfIds are present however in the names of the data file directories (data file directories are formatted as {keyspace}/{table}-{cfId}), and from this we can infer that the table in question (in both cases) is commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w.meta.

This explains the source of the reads, as the meta tables are consulted whenever a RESTBase process starts up.

Given this was one of the first tables created, I believe this predates our more recent attempts at avoiding concurrency, and that it was the result of concurrent CREATEs for the same table.

Restarting Cassandra should clear the corrupt in-memory state. I had been operating under the recollection that we had since done restarts, but a quick check confirms that the last rolling restart does predate the creation of commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w.meta.

$ cdsh -c restbase-ng -- 'for i in `c-ls`; do echo -n "$i "; systemctl status cassandra-$i |grep Active; done'
restbase1010.eqiad.wmnet: a    Active: active (running) since Wed 2017-09-20 15:22:15 UTC; 1 months 13 days ago
restbase1010.eqiad.wmnet: b    Active: active (running) since Wed 2017-09-20 16:02:08 UTC; 1 months 13 days ago
restbase1010.eqiad.wmnet: c    Active: active (running) since Wed 2017-09-20 16:04:05 UTC; 1 months 13 days ago
restbase1008.eqiad.wmnet: a    Active: active (running) since Wed 2017-09-20 15:25:16 UTC; 1 months 13 days ago
restbase1008.eqiad.wmnet: b    Active: active (running) since Wed 2017-09-20 16:06:00 UTC; 1 months 13 days ago
restbase1008.eqiad.wmnet: c    Active: active (running) since Wed 2017-09-20 16:07:38 UTC; 1 months 13 days ago
restbase1009.eqiad.wmnet: a    Active: active (running) since Tue 2017-10-03 19:18:33 UTC; 1 months 0 days ago
restbase1009.eqiad.wmnet: b    Active: active (running) since Wed 2017-09-20 16:09:30 UTC; 1 months 13 days ago
restbase1009.eqiad.wmnet: c    Active: active (running) since Wed 2017-09-20 16:11:03 UTC; 1 months 13 days ago
restbase2003.codfw.wmnet: a    Active: active (running) since Thu 2017-10-26 18:54:22 UTC; 1 weeks 0 days ago
restbase2003.codfw.wmnet: b    Active: active (running) since Wed 2017-09-20 16:17:42 UTC; 1 months 13 days ago
restbase2003.codfw.wmnet: c    Active: active (running) since Wed 2017-09-20 16:20:20 UTC; 1 months 13 days ago
restbase2001.codfw.wmnet: a    Active: active (running) since Wed 2017-09-20 15:32:50 UTC; 1 months 13 days ago
restbase2001.codfw.wmnet: b    Active: active (running) since Thu 2017-10-26 16:17:42 UTC; 1 weeks 0 days ago
restbase2001.codfw.wmnet: c    Active: active (running) since Wed 2017-10-25 21:28:33 UTC; 1 weeks 1 days ago
restbase2005.codfw.wmnet: a    Active: active (running) since Wed 2017-09-20 15:34:24 UTC; 1 months 13 days ago
restbase2005.codfw.wmnet: b    Active: active (running) since Wed 2017-09-20 16:23:12 UTC; 1 months 13 days ago
restbase2005.codfw.wmnet: c    Active: active (running) since Wed 2017-09-20 16:25:32 UTC; 1 months 13 days ago

Some care should be take when we do though, because writes that occurred on a coordinator with a corrupt schema state will be lost after the restart. I don't think that will be a problem for us, the meta tables are written exactly once after creation (unless there are RESTBase initiated schema migrations), so all sides of the split brain should match, but doing a non-normative ALTER, or a issuing a repair immediately after would probably be a good idea.

Mentioned in SAL (#wikimedia-operations) [2017-11-06T17:48:31Z] <urandom> T179083: Restarting Cassandra instances on restbase2001.codfw.wmnet

Today we've had a semi-outage because of this.

The read latency for "commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w" keyspace went through the roof on restbase1010-a, so worker restarts on restbase1013 and restbase1015, where apparently that node was a coordinator could never finish because reads from the meta table for this keyspace was timing out. After some time all the workers got affected, so RESTBase became unresponsive on 1013 and 1015 hosts.

In the meantime cassandra-a on restbase1010 was logging Handshaking version with restbase1009-c.eqiad.wmnet like crazy while restbase1009-c was logging the UnconfiguredColumnFamilyExceptions for df90c320-9e2e-11e7-a592-958801a49e5b.

Restarting just restbase1010-a didn't help, however restarting restbase1009-c fixed the issue.

We should do a proper rolling restart of all Cassandra 3 instances ASAP.

Eevans added a comment.Nov 8 2017, 3:16 PM

Today we've had a semi-outage because of this.

The read latency for "commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w" keyspace went through the roof on restbase1010-a, so worker restarts on restbase1013 and restbase1015, where apparently that node was a coordinator could never finish because reads from the meta table for this keyspace was timing out. After some time all the workers got affected, so RESTBase became unresponsive on 1013 and 1015 hosts.

In the meantime cassandra-a on restbase1010 was logging Handshaking version with restbase1009-c.eqiad.wmnet like crazy while restbase1009-c was logging the UnconfiguredColumnFamilyExceptions for df90c320-9e2e-11e7-a592-958801a49e5b.

Restarting just restbase1010-a didn't help, however restarting restbase1009-c fixed the issue.

We should do a proper rolling restart of all Cassandra 3 instances ASAP.

Hrmm, in an earlier comment I wrote:

[ ... ]

Some care should be take when we do though, because writes that occurred on a coordinator with a corrupt schema state will be lost after the restart. I don't think that will be a problem for us, the meta tables are written exactly once after creation (unless there are RESTBase initiated schema migrations), so all sides of the split brain should match, but doing a non-normative ALTER, or a issuing a repair immediately after would probably be a good idea.

After writing this I think I became less and less convinced this would actually be a problem, and so dropped any notion that an ALTER or repair would be necessary. Perhaps I was correct the first time.

Eevans added a comment.Nov 8 2017, 3:30 PM

[ ... ]

We should do a proper rolling restart of all Cassandra 3 instances ASAP.

I propose to do this:

  1. For each rack, restart the instances in that rack (there is presently only one node per rack)
  2. Issue a minor ALTER of "commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w".meta to force schema migrations out to all nodes (e.g. ALTER...WITH comment = 'foo')
  3. Wait for agreement
  4. Goto 1

We could alternatively just verify the existence of "commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w".meta afterward, and be prepared to perform the above steps.

Eevans added a comment.Nov 8 2017, 6:16 PM

All instances have been restarted

Eevans added a comment.Nov 8 2017, 8:21 PM

Schema for restrictions (T179421) and mathoid (T179419) have been created using the mkschema script, and no exceptions were logged.

Eevans lowered the priority of this task from High to Medium.Nov 8 2017, 8:22 PM
Eevans moved this task from In-Progress to Next on the User-Eevans board.Nov 17 2017, 4:18 PM
Eevans moved this task from Next to In-Progress on the User-Eevans board.Nov 21 2017, 5:01 PM
Eevans updated the task description. (Show Details)Nov 21 2017, 7:25 PM
Eevans renamed this task from Cassandra 3.11.0 schema creation seems unreliable to Cassandra schema creation seems unreliable.Nov 21 2017, 7:59 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Nov 21 2017, 8:26 PM
Eevans updated the task description. (Show Details)Nov 21 2017, 8:47 PM
Eevans updated the task description. (Show Details)Nov 21 2017, 8:58 PM
Eevans moved this task from In-Progress to Blocked on the User-Eevans board.Nov 30 2017, 4:11 PM
Eevans updated the task description. (Show Details)Jul 3 2018, 9:18 PM
Eevans updated the task description. (Show Details)