The current process for importing AQS data individually inserts records using CQL prepared statements. Judging from the dashboards, this process takes hours, increases cluster utilization, and impacts latency. It might be worth considering the use of SSTable bulk loading, instead.
In a nutshell, you would:
- Locally generate SSTable files
- Stream the data to cluster nodes
(1) is actually quite straightforward to do using the [[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java#L48|CQLSSTableWriter]] class:
// Define a matching schema. String schema = "CREATE TABLE foo (id int PRIMARY KEY, name text)"; // Define a matching insert statement. String insert = "INSERT INTO foo (id, name) VALUES (?, ?)"; // Construct a writer using the schema and insert statement CQLSSTableWriter writer = CQLSSTableWriter.builder() .inDirectory("/path/to/output/directory") .forTable(schema) .using(insert).build(); // Write your records. writer.addRow(0, "test1"); writer.addRow(1, "test2"); writer.addRow(2, "test3"); ... // Finish the file. writer.close();
(2) is performed using the sstableloader utility (ships with Cassandra).
$ sstableloader -d aqs1001.eqiad.wmnet /path/to/output/directory
The -d argument specifies a contact node, which is used to suss out the cluster topology. It will send the data using Cassandra's own streaming mechanism, which is quite efficient; This should be an enormous win over individual inserts.
SSTable-to-node affinity is not required, sstableloader will only stream the relevant parts of the files to the respective nodes. If needed, this can be parallelized by simply running more sstableloader processes.
I suspect this would also reduce the post-import compaction load as well.
Let me know if you're interested in doing this, and how I can be of help!