Page MenuHomePhabricator

Use airflow to load cassandra
Closed, DuplicatePublic

Description

Open question (probably for @mforns): should we use an Airflow Operator (probably subclassing Spark or SparkSQL), or a DAG factory?

There are many (16) jobs to implement, some parameters can be defaulted:

  • HQL year, month, day, hour
  • cassandra host, port, user, password

others need to be defined:

  • cassandra keyspace and table
  • other HQL parameters

Event Timeline

@JAllemandou
Thanks for working on this!

I think an Operator will be enough no?
Or is there more than 1 action to do?
IIUC, the only thing we need to do is to run the Spark-Scala code that reads a query and loads to Cassandra, right?
If so, an Operator will be the best.

If there are more actions to do *always* (like writing a success flag or doing some data quality checks, etc.), then we can make a factory,
but it would be a TaskGroup factory instead of a DAG factory.

Hey @mforns - Thank you for your comment :)
For most jobs it's a single action - my question was more about the need for an executor versus reusing the existing spark one, as it'll only be a matter of a few parameters difference.

After a great talk with @Antoine_Quhen a wider discussion needs to happen: Spark3 offers the possibility to write to cassandra through SQL-like queries (see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md).

So now the question is whether we prefer to go the spark3 road or the one we had in mind using spark2.4:

  • Spark3: we need to test the loading and wait for spark3 productionization (work to be done for reading password from file)
  • Spark2.4: we can move now (work in progress about using password from file)

One of the benefit of going the spark3 road is the simplicity of the loading jobs: only an HQL query is needed (with the correct configuration) instead of a spark job in addition to the HQL query.

I don't mind waiting for Spark3, but let's see what the team thinks!
On the other hand, if we write a CassandraLoadOperator() that uses the existing Cassandra loading Spark job now,
and migrate the jobs using it, we can always change to use the Spark3 SQL loader by just changing the inner workings of the CassandraLoadOperator, no?
There might be some adjustments needed, but I imagine the operators interface will be very similar, and we'll barely have to change the migrated DAGs.
Some re-testing will be needed though... that's the part where we'll lose some time.

Actually there would some difference, as using Spark3 would make the related HQL queries in the form:

INSERT INTO aqs.local_group_default_T_pageviews_per_project_v2.data SELECT ...

instead of just the select part. But the select part would be the same though...
The concern with waiting (I forgot to mention above) is that it prevents us from decommissioning the old AQS nodes.
If we decide to wait for spark3, we should update the oozie jobs.

The concern with waiting (I forgot to mention above) is that it prevents us from decommissioning the old AQS nodes.

I'd also come down on the side of moving forward to spark3, despite the extra time until we can decommission the aqs nodes.

Having spoken to some other members of SRE they're keen to decommission aqs100* because they're running stretch, but at the same time as long as we have a plan and have mitigated any security issues (both of which we have) then we're free to make the pragmatic choice until we're ready.

We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one?

We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one?

I wouldn't mind even a little bit. Thanks.

Having spoken to some other members of SRE they're keen to decommission aqs100* because they're running stretch, but at the same time as long as we have a plan and have mitigated any security issues (both of which we have) then we're free to make the pragmatic choice until we're ready.

Specifically security support for the Debian OS version running the old aqs* hosts ended last Friday. Since there are other roles also not yet done with migrating away, SRE IF will continue to backport security fixes for this quarter (or until all Stretch hosts are gone if that takes less than three months), but that's a) quite time consuming (and b) we should rather spend our time on shiny new things!), so there's a hard limit of three months. If the Spark3 work isn't done by then, then we'd need to upgrade the old AQS cluster to Buster by end of September.

Thank you for the head up @MoritzMuehlenhoff. The migration should be done in a matter of weeks so I'm confident we'll be done by September.

Checking in on this, given that there's one month left until Stretch servers need to be gone, what's the status here?

Checking in on this, given that there's one month left until Stretch servers need to be gone, what's the status here?

Thanks for checking @MoritzMuehlenhoff.
We are well advanced on the process. We were planning on having this done by this end of month but we are late and I plan this to be done in 2 weeks.

Checking in on this, given that there's one month left until Stretch servers need to be gone, what's the status here?

Thanks for checking @MoritzMuehlenhoff.
We are well advanced on the process. We were planning on having this done by this end of month but we are late and I plan this to be done in 2 weeks.

Thanks for the update.