Page MenuHomePhabricator

Consider SSTable bulk loading for AQS imports
Closed, ResolvedPublic8 Story Points

Description

The current process for importing AQS data individually inserts records using CQL prepared statements. Judging from the dashboards, this process takes hours, increases cluster utilization, and impacts latency. It might be worth considering the use of SSTable bulk loading, instead.

In a nutshell, you would:

  1. Locally generate SSTable files
  2. Stream the data to cluster nodes

(1) is actually quite straightforward to do using the CQLSSTableWriter class:

// Define a matching schema.
String schema = "CREATE TABLE foo (id int PRIMARY KEY, name text)";
// Define a matching insert statement.
String insert = "INSERT INTO foo (id, name) VALUES (?, ?)";

// Construct a writer using the schema and insert statement
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                                          .inDirectory("/path/to/output/directory")
                                          .forTable(schema)
                                          .using(insert).build();

// Write your records.
writer.addRow(0, "test1");
writer.addRow(1, "test2");
writer.addRow(2, "test3");
...

// Finish the file.
writer.close();

(2) is performed using the sstableloader utility (ships with Cassandra).

$ sstableloader -d aqs1001.eqiad.wmnet /path/to/output/directory

The -d argument specifies a contact node, which is used to suss out the cluster topology. It will send the data using Cassandra's own streaming mechanism, which is quite efficient; This should be an enormous win over individual inserts.

SSTable-to-node affinity is not required, sstableloader will only stream the relevant parts of the files to the respective nodes. If needed, this can be parallelized by simply running more sstableloader processes.

I suspect this would also reduce the post-import compaction load as well.

Let me know if you're interested in doing this, and how I can be of help!

Event Timeline

Eevans created this task.Feb 8 2016, 5:47 PM
Eevans raised the priority of this task from to Needs Triage.
Eevans updated the task description. (Show Details)
Eevans added a project: Analytics.
Eevans added a subscriber: Eevans.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 8 2016, 5:47 PM
Eevans set Security to None.Feb 8 2016, 6:11 PM
Eevans added a subscriber: JAllemandou.
Johsthao closed this task as a duplicate of T126250: <spam>.Feb 8 2016, 6:24 PM
matmarex reopened this task as Open.Feb 8 2016, 6:32 PM
Milimetric triaged this task as Normal priority.Feb 11 2016, 6:12 PM
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.
Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.Jun 2 2016, 5:05 PM
elukey added a subscriber: elukey.Jun 20 2016, 3:18 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou set the point value for this task to 8.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou moved this task from Paused to Done on the Analytics-Kanban board.Jul 18 2016, 4:35 PM

A last try has been done with sorting data to build almost sorted SSTables, but with no success.
Cassandra sorting strategy is very difficult to replicate, and I didn't manage to have it working.

Nuria closed this task as Resolved.Jul 19 2016, 7:20 PM