Page MenuHomePhabricator

Create Airflow job to populate a new Cassandra table with latest MediaWiki history snapshot configuration
Open, MediumPublic8 Estimated Story Points

Description

As described in T355536, we want to automatize the update of the MediaWiki history snapshot in Druid.

For that, we need Airflow and AQS to communicate.
The medium of choice is a Cassandra table, since we already have tools in place to load to Cassandra from Airflow, and to read Cassandra from AQS.
Druid was also a similar option, however the key-value nature of Cassandra fits the purpose better than the cube-aggregation nature of Druid.

This task is about defining a good schema for the Cassandra table together with Data Persistence team,
and creating an Airflow DAG to load the latest MediaWiki reduced snapshot to the mentioned table.

NOTE: We've chosen to temporarily use Cassandra as a config store for AQS as a convenient interim solution until the Dataset Config System T354557 is available.

Event Timeline

Change 989558 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Add query to load MediaWiki snapshot to Cassandra AQS config table

https://gerrit.wikimedia.org/r/989558

mforns renamed this task from AQS 2.0 Automate config change to druid mediawiki history reduced snapshot to Create Airflow job to populate a new Cassandra table with latest MediaWiki history snapshot configuration.Jan 22 2024, 3:23 PM
mforns updated the task description. (Show Details)
mforns updated the task description. (Show Details)

From r989558:

Note to @Eevans: I assumed the Cassandra table is going to be named aqs.local_group_default_T_aqs_config.data, but I just did that because other Data Engineering tables were formatted like that. I saw some image_suggestions tables had another approach, but didn't know what to chose.

The local_group_{id}_T_{table_name}.{data,meta} convention came from RESTBase (which the AQS v1 services are based on), which attempted to create its own table storage interface atop Cassandra. We're stuck with this for the extant datasets, but if we were starting from scratch today, I'd have suggested that all of the aqs tables be grouped under a single keyspace (aqs seeming like the most obvious choice), with each of the aqs datasets stored as tables beneath (i.e. aqs.editors_bycountry, aqs.unique_devices, etc).

So, if there are no objections, how about aqs.config?

CREATE KEYSPACE aqs WITH replication = {'class': 'NetworkTopologyStrategy', 'eqiad': 3, 'codfw': 3}
CREATE TABLE    aqs.config (param text PRIMARY KEY, value text);
NOTE: I've chosen param/value for attribute names here as well

So, if there are no objections, how about aqs.config?

NOTE: I've chosen param/value for attribute names here as well

@Eevans, sounds great to me! Will update the Gerrit change accordingly.

For posterity sake: This wasn't anyone's first choice for storage of configuration state (as least as implemented on this cluster), but we are moving forward with it as a stop-gap until completion of T354557: Dataset Config Store. I've opened T355911: Replace use of the aqs.config table with Dataset Config Store to track replacing this with the final implementation, when it becomes ready.

@mforns the keyspace and table have been created.

The aqs and aqsloader users have permissions to read/write to everything¹, so the former will work for any AQS v1 (legacy) services, and the latter should work for Airflow jobs. Before closing this ticket though, I need the list of AQS service (or services) that will read from this. The AQS v2 services each have their own users, and so we'll need to grant SELECT to them that require it.

¹: Which is wrong, but I will follow up on that in another ticket 😀
Eevans triaged this task as Medium priority.Jan 25 2024, 11:45 PM

@mforns Before closing this ticket, can I get a list of the AQS service (or services) that will read from this so that I can add the apropos permissions to the AQS v2 users.

Change 994225 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: cassandra roles for druid-based aqs endpoints

https://gerrit.wikimedia.org/r/994225

@mforns Before closing this ticket, can I get a list of the AQS service (or services) that will read from this so that I can add the apropos permissions to the AQS v2 users.

This would edit-analytics, and editor-analytics...

Change 989558 merged by Mforns:

[analytics/refinery@master] Add query to load MediaWiki snapshot to Cassandra AQS config table

https://gerrit.wikimedia.org/r/989558

mforns set the point value for this task to 8.Jan 31 2024, 5:07 PM

Change 994225 merged by Eevans:

[operations/puppet@production] cassandra: cassandra roles for druid-based aqs endpoints

https://gerrit.wikimedia.org/r/994225

Change 994814 had a related patch set uploaded (by Eevans; author: Eevans):

[labs/private@master] add (faux) creds for {edit,editor}_analytics roles

https://gerrit.wikimedia.org/r/994814

Change 994814 merged by Eevans:

[labs/private@master] add (faux) creds for {edit,editor}_analytics roles

https://gerrit.wikimedia.org/r/994814

Change 994815 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: apply {edit,editor}_analytics roles (users)

https://gerrit.wikimedia.org/r/994815

Change 994815 merged by Eevans:

[operations/puppet@production] aqs: apply {edit,editor}_analytics roles (users)

https://gerrit.wikimedia.org/r/994815

I've added two new roles (users) for the edit-analytics and editor-analytics services (edit_analytics and editor_analytics respectively). Credentials for these roles have been set in the private repository (where they can be templated to configuration during deploy).

cassandra@cqlsh> LIST ALL PERMISSIONS OF edit_analytics;

 role           | username       | resource           | permission
----------------+----------------+--------------------+------------
 edit_analytics | edit_analytics | <table aqs.config> |     SELECT

(1 rows)
cassandra@cqlsh> LIST ALL PERMISSIONS OF editor_analytics;

 role             | username         | resource           | permission
------------------+------------------+--------------------+------------
 editor_analytics | editor_analytics | <table aqs.config> |     SELECT

(1 rows)
cassandra@cqlsh>