The current process for importing AQS data [[https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-cassandra/src/main/java/org/wikimedia/analytics/refinery/cassandra|individually inserts records using CQL prepared statements]]. Judging from the [[https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system?from=1454889569205&to=1454940331795&var-node=All|dashboards]], this process takes hours, increases cluster utilization, and impacts latency. It might be worth considering the use of SSTable bulk loading, instead.
In a nutshell, you would:
1. Locally generate SSTable files
2. Stream the data to cluster nodes
(1) is actually quite straightforward to do using the [[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java#L48|`CQLSSTableWriter`]] class:
```lang=java
// Define a matching schema.
String schema = "CREATE TABLE foo (id int PRIMARY KEY, name text)";
// Define a matching insert statement.
String insert = "INSERT INTO foo (id, name) VALUES (?, ?)";
// Construct a writer using the schema and insert statement
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/path/to/output/directory")
.forTable(schema)
.using(insert).build();
// Write your records.
writer.addRow(0, "test1");
writer.addRow(1, "test2");
writer.addRow(2, "test3");
...
// Finish the file.
writer.close();
```
(2) is performed using the `sstableloader` utility (ships with Cassandra).
```
$ sstableloader -d aqs1001.eqiad.wmnet /path/to/output/directory
```
The `-d` argument specifies a contact node, which is used to suss out the cluster topology. It will send the data using Cassandra's own streaming mechanism, which is quite efficient; This should be an enormous win over individual inserts.
SSTable-to-node affinity is not required, `sstableloader` will only stream the relevant parts of the files to the respective nodes. If needed, this can be parallelized by simply running more `sstableloader` processes.
I suspect this would also reduce the post-import compaction load as well.
Let me know if you're interested in doing this, and how I can be of help!