# Schema v. Configuration
Cassandra's DDL covers what is traditionally considered //schema//, but also information that is more configuration in nature. For example, consider keyspace creation:
```lang=sql,name="keyspace"
CREATE KEYSPACE "globaldomain_T_mathoid__ng_mml" WITH replication = {'class': 'NetworkTopologyStrategy', 'codfw': '3', 'eqiad': '3'} AND durable_writes = true;
-- \___________________________________________/ \________________________________________________________________________________________________________/
-- | |
-- schema configuration
```
A keyspace in Cassandra is a namespace to associate tables with (similar to a database in MySQL terminology). Here, //globaldomain_T_mathoid__ng_mml// is the keyspace, and everything that follows the `WITH` is configuration pertaining to associated tables (replication, or whether or not to make use of the commitlog).
It is similar with tables:
```lang=sql,name=table
-- Schema ~~~~~~~~~~~~~~~
CREATE TABLE "globaldomain_T_mathoid__ng_mml".data (
"_domain" text,
key text,
headers text,
tid timeuuid,
value text,
PRIMARY KEY (("_domain", key))
-- Configuration ~~~~~~~~
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '32', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
```
Likewise, the DDL describes both schema, and table-specific configuration. In the example above, //"globaldomain_T_mathoid__ng_mml".data// is the table name, followed within the parenthesis by the names and types of the attributes. This is schema, as it models the data to be stored there. Everything that follows the `WITH` however, is configuration.
This is an important distinction (schema v configuration), because schema is determined by the application; No change in schema makes sense without a corresponding change to the application. Configuration however is site-specific, and operational in nature; Parameters can be unique to a use-case, and updated frequently outside of any change to the application. !!Schema is determined by application developers, configuration by users/operators.!!
(WARNING) Unfortunately, it is colloquial to refer to the entire DDL for a keyspace and/or table as //schema//. However, every effort is made throughput this ticket to use the term //schema// to mean only that which determines the data model, and //configuration// to refer to the corresponding operational settings.
# Proposal
## Schema Management
Since schema is tightly coupled to the application, it makes sense that it be kept versioned with the application code, where it can be changed in lock-step.
Ideally, we'd omit all configuration data, and rely on post-creation `ALTER`s to update defaults, but `CREATE KEYPACE` requires us to supply replication parameters. In this case all we can do is provide the minimum information (and document for third-parties the expectation that they followup).
```lang=sql,name=schema.cql
-- Use ALTER after keyspace creation to update replication according to your site requirements.
CREATE KEYSPACE data
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
-- NOTE: Use ALTER after table creation to update default table properties, if necessary.
CREATE TABLE data.values (
username text PRIMARY KEY,
given text,
surname text
);
```
A complete schema is only useful for initial setup; It may also be necessary to ship the DDL needed to upgrade schema between releases.
```lang=sql,name=1.0-to-1.1.cql
ALTER TABLE data.values ADD email text;
```
## Configuration Management
A YAML-formatted file will be used to define the configuration properties for a cluster. The file will be maintained in version control (Puppet?), and synchronized to each of the corresponding cluster nodes.
```lang=YAML,name=table_properties.yaml
keyspaces:
- name: keyspace1
replication:
class: NetworkTopologyStrategy
codfw: 2
eqiad: 2
durable_writes: true
tables:
- name: data
compaction:
class: SizeTieredCompactionStrategy
min_threshold: 4
max_threshold: 32
speculative_retry: 99PERCENTILE
gc_grace_seconds: 864000
- name: data1
compaction:
class: SizeTieredCompactionStrategy
min_threshold: 4
max_threshold: 32
speculative_retry: 99PERCENTILE
gc_grace_seconds: 864000
```
A utility installed on the Cassandra nodes will accept the configuration file as an argument, connect to the corresponding cluster, introspect the current configuration, and output (stdout) the necessary `ALTER` statements to match the cluster to the configuration.
A database operator will be able to log into any node, review the output, and when ready, apply it using `cqlsh` (i.e. `table_properties /etc/cassandra-a/table_properties.yaml | cqlsh`).
Some additional requirements of this utility:
* Should be packaged for Debian, and installed as part of the Cassandra role (Puppet)
* Must have a mode that outputs `ALTER` statements for the entire configuration, regardless of current cluster state
* Must output at most one `ALTER` per table, even if multiple properties are to be set (ala `AND`)
* Design should not preclude being expanded to manage schema as well as configuration at a later date
### Proof-of-concept
A working prototype for the above can be [[ https://github.com/eevans/cassandra-table-properties/tree/breakfixes | found on Github ]].
To get started:
```lang=shell-session
$ pip install -r requirements.txt
$ ./table-properties --help
usage: table-properties [-h] [-c <ip>] [-C <filename>] [-d <filename>] [-f] [-k <filename>] [-l <filename>]
[-p <port #>] [-o <protocol version>] [-P <password>] [-s] [-r <filename>] [-t] [-u <user name>]
[-v]
[<filename>]
Compare actual Cassandra keyspace and table properties to desired properties defined in a YAML file and create ALTER KEYSPACE and ALTER TABLE statements for properties that differ.
positional arguments:
<filename> Desired configuration YAML file
optional arguments:
-h, --help show this help message and exit
-c <ip>, --contactpoint <ip>
Host IP address or name.Default: localhost
-C <filename>, --clientcert <filename>
Client cert file name.
-d <filename>, --dump <filename>
Dump current configuration settings to file
-f, --force Overwrite dump file if it exists.
-k <filename>, --clientkey <filename>
Client key file name.
-l <filename>, --log <filename>
Log file name. Default: tp_YYYYMMDD-HHMMSS.log
-p <port #>, --port <port #>
Port number. Default: 9042
-o <protocol version>, --protocolversion <protocol version>
Cassandra driver protocol version (1-5).Default: 2
-P <password>, --password <password>
Password for plain text authentication.
-s, --skiprc Ignore existing cqlshrc file. Default: true
-r <filename>, --rcfile <filename>
cqlrc file name. Default: ~/.cassandra/cqlshrc
-t, --tls Use TLS encryption for client server communication.
-u <user name>, --username <user name>
User name for plain text authentication.
-v, --version show program's version number and exit
```
The current configuration can be dumped to create the initial config. In production, this would be checked into version control as part of operations/puppet, with each unique config corresponding to a Cassandra cluster (role-based?).
```lang=shell-session
$ table-properties -d tables.yaml
```
Updates committed to configs in operations/puppet would be copied into the instance configuration directories of the Cassandra nodes. An operator would then log into one node, and generate the corresponding `ALTER` statement output.
```lang=shell-session
$ table-properties /etc/cassandra-a/sessionstore_table_properties.yaml
USE "kask";
ALTER TABLE values
WITH caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'};
```
Assuming that output is valid, it could be applied using `cqlshrc`.
```lang=shell-session
$ table-properties /etc/cassandra-a/sessionstore_table_properties.yaml | cqlsh
```