Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (386 w, 15 h)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Tue, Jul 5

JAllemandou created T312151: Review iceberg settings and document choices.
Tue, Jul 5, 6:34 PM · Data-Engineering
JAllemandou added a comment to T306962: Use airflow to load cassandra.

Thank you for the head up @MoritzMuehlenhoff. The migration should be done in a matter of weeks so I'm confident we'll be done by September.

Tue, Jul 5, 5:59 PM · Data-Engineering, Airflow
JAllemandou merged T306962: Use airflow to load cassandra into T309995: Migrate all Cassandra Jobs.
Tue, Jul 5, 4:19 PM · Epic, Airflow
JAllemandou merged task T306962: Use airflow to load cassandra into T309995: Migrate all Cassandra Jobs.
Tue, Jul 5, 4:19 PM · Data-Engineering, Airflow
JAllemandou added a comment to T306962: Use airflow to load cassandra.

We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one?

Tue, Jul 5, 4:15 PM · Data-Engineering, Airflow

Mon, Jul 4

JAllemandou placed T311976: Investigate why airflow sensor tasks fail without sending errors up for grabs.
Mon, Jul 4, 3:11 PM · Airflow, Data Engineering Planning (Sprint 01), Data-Engineering-Kanban
JAllemandou created T311976: Investigate why airflow sensor tasks fail without sending errors.
Mon, Jul 4, 6:50 AM · Airflow, Data Engineering Planning (Sprint 01), Data-Engineering-Kanban

Thu, Jun 30

JAllemandou added a comment to T311525: Upgrade to latest PrestoDB and enable iceberg support.

No, superset staging doesn't use presto-test - there is almost no data nor computation power under that one. I don't have a better solution than deploying and testing, to possibly roll-back if too many problems show up :S

Thu, Jun 30, 4:33 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban
JAllemandou added a comment to T311190: Establish testing procedure for Druid-based endpoints.

I forgot to add on this:

Thu, Jun 30, 1:23 PM · Data-Engineering, API Platform
JAllemandou added a comment to T311190: Establish testing procedure for Druid-based endpoints.
  1. I think I need a "spec" and matching data to ingest

Yes! We have examples of spec as well as data for you.
The hadoop-ingestion spec (templated) of the mediawiki_history_reduced dataset is here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/reduced/load_mediawiki_history_reduced.json.template

Thu, Jun 30, 9:42 AM · Data-Engineering, API Platform
JAllemandou moved T306955: Spark3 migration - Currently existing airflow jobs from Ready to Deploy to Done on the Data-Engineering-Kanban board.
Thu, Jun 30, 8:14 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou added a comment to T311525: Upgrade to latest PrestoDB and enable iceberg support.

I confirm it works for me! Let's maybe give it a try on the prod cluster and ask our end-users to check their queries/dashboards?

Thu, Jun 30, 7:54 AM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban

Wed, Jun 29

JAllemandou moved T307935: [Airflow] Proof of concept of Cassandra loading from Next Up to Done on the Data-Engineering-Kanban board.
Wed, Jun 29, 4:15 PM · Data-Engineering-Kanban, Data-Engineering, Airflow
JAllemandou added a comment to T307935: [Airflow] Proof of concept of Cassandra loading.

This has been tested successfully - we'll need simple HQL jobs to load cassandra from now on :)

Wed, Jun 29, 4:14 PM · Data-Engineering-Kanban, Data-Engineering, Airflow
JAllemandou updated the task description for T307935: [Airflow] Proof of concept of Cassandra loading.
Wed, Jun 29, 4:14 PM · Data-Engineering-Kanban, Data-Engineering, Airflow

Tue, Jun 28

JAllemandou added a comment to T305600: Properly add aqsloader user (w/ secrets).

Now we (Data-Engineering) need to adapt when the aqsloader user comes with a password:

@JAllemandou Do you know if Data Engineering is working on this? Thanks : )

Tue, Jun 28, 4:34 PM · Data Engineering Planning, Data-Engineering-Kanban, Cassandra, User-Eevans
JAllemandou added a comment to T311525: Upgrade to latest PrestoDB and enable iceberg support.

Open question: Presto latest is 0.273.3 - Would we bump more than the minimal one for Iceberg?

Tue, Jun 28, 4:12 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban
JAllemandou created T311508: Create Airflow jobs using HQL files to load cassandra.
Tue, Jun 28, 1:14 PM · Data Engineering Planning, Airflow
JAllemandou removed a project from T311507: Create cassandra loading HQL files from their oozie definition: Epic.
Tue, Jun 28, 1:11 PM · Data Engineering Planning (Sprint 01), Data-Engineering-Kanban, Airflow
JAllemandou placed T311507: Create cassandra loading HQL files from their oozie definition up for grabs.
Tue, Jun 28, 1:08 PM · Data Engineering Planning (Sprint 01), Data-Engineering-Kanban, Airflow
JAllemandou created T311507: Create cassandra loading HQL files from their oozie definition.
Tue, Jun 28, 1:08 PM · Data Engineering Planning (Sprint 01), Data-Engineering-Kanban, Airflow
JAllemandou added a comment to T311190: Establish testing procedure for Druid-based endpoints.

Hi @BPirkle - I'll gladly spend some time with you (and anyone interested) to explain more about Druid if needed :)

Tue, Jun 28, 6:56 AM · Data-Engineering, API Platform

Fri, Jun 24

JAllemandou added a comment to T311263: Investigate Gobblin dataloss during namenode failure.

Thanks a lot @Ottomata for the backfill.

Fri, Jun 24, 4:50 PM · Data-Engineering-Kanban

Thu, Jun 23

JAllemandou added a comment to T310542: [Airflow] Refactor HDFSArchiveOperator to run in Skein.

@Snwachukwu, following our talk about optimization, you may try:

  • to use the un-shaded job jar, which is lighter than the shaded one,

While this is feasible, it feels not that simple: there is more in the shaded than just the hadoop depencies, noticeably scala libs. Using the shaded should be a lot simpler. I don't know how it is done with the skein operator, but skein allows to get files from HDFS (see the code example and its comments in https://jcristharif.com/skein/distributing-files.html#specifying-files-for-a-service)

Thu, Jun 23, 9:27 AM · Patch-For-Review, Data Engineering Planning (Sprint 01), Data-Engineering-Kanban, Airflow
JAllemandou moved T310576: Update webrequest error thresholds from In Code Review to Done on the Data-Engineering-Kanban board.
Thu, Jun 23, 9:20 AM · Data-Engineering-Kanban, Data-Engineering
JAllemandou added a comment to T281483: mediawiki/page/properties-change schema should use map type for added and removed page properties.

Adding perspective on this: Using map instead of structured and defined schema allows for more flexibility and versatility, but prevents deeper validation. The decision of using one versus the other should not be made lightly :)

Thu, Jun 23, 7:37 AM · Data-Engineering, Event-Platform, Analytics

Tue, Jun 21

JAllemandou added a comment to T309046: Airflow: pin dependency versions to prevent long installs.

I faced the same issue and the problem was due to a failed install of a previous package due to a missing dependency on the host (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide#Setting_up_the_environment). I'd be interested to know if this issue was the same.

Tue, Jun 21, 5:31 PM · Data Engineering Planning, Airflow

Thu, Jun 16

JAllemandou added a project to T309229: Make Cassandra client encryption non-optional (AQS cluster): Data-Engineering-Radar.
Thu, Jun 16, 4:53 PM · Data-Engineering-Radar, Cassandra
JAllemandou added a comment to T309229: Make Cassandra client encryption non-optional (AQS cluster).

Question about AQS Should we wait for AQS-2.0 to do this instead of changing the old node code?

Thu, Jun 16, 4:51 PM · Data-Engineering-Radar, Cassandra
JAllemandou created T310820: Encrypt Spark-Cassandra connection.
Thu, Jun 16, 4:50 PM · Data Engineering Planning, Cassandra
JAllemandou moved T310297: Airflow DagProcessor not refreshing all dags from Incoming to Transform on the Data-Engineering board.
Thu, Jun 16, 4:27 PM · Data-Engineering, Data-Engineering-Kanban, Airflow
JAllemandou added a project to T310593: Experiencing pipeline failure due to disk-space issues: Data-Engineering.
Thu, Jun 16, 4:22 PM · Data-Engineering, GitLab
JAllemandou moved T300054: [Airflow] Add DAG subfolder name to error email's subject from In Review to Done on the Airflow board.
Thu, Jun 16, 4:21 PM · Airflow, Data-Engineering
JAllemandou added a project to T310576: Update webrequest error thresholds: Data-Engineering-Kanban.
Thu, Jun 16, 4:18 PM · Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T310542: [Airflow] Refactor HDFSArchiveOperator to run in Skein from Backlog to Next Up on the Airflow board.
Thu, Jun 16, 4:16 PM · Patch-For-Review, Data Engineering Planning (Sprint 01), Data-Engineering-Kanban, Airflow
JAllemandou claimed T310576: Update webrequest error thresholds.
Thu, Jun 16, 4:14 PM · Data-Engineering-Kanban, Data-Engineering

Wed, Jun 15

JAllemandou awarded T310686: Re-enable CAS-SSO for hue.wikimedia.org a 100 token.
Wed, Jun 15, 11:35 AM · Infrastructure-Foundations, Data-Engineering-Kanban, Data-Engineering

Tue, Jun 14

JAllemandou added a comment to T309649: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition.

The gobblin problem is a known issue - we have setup alerts (that worked!) that cover us from this.

Tue, Jun 14, 4:46 PM · Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T309993: Spark 3 Migration from Next Up to Q2 Epics on the Data-Engineering-Kanban board.
Tue, Jun 14, 4:17 PM · Data-Engineering-Kanban, Data-Engineering, Epic, Airflow
JAllemandou updated the task description for T310593: Experiencing pipeline failure due to disk-space issues.
Tue, Jun 14, 10:35 AM · Data-Engineering, GitLab
JAllemandou created T310593: Experiencing pipeline failure due to disk-space issues.
Tue, Jun 14, 10:32 AM · Data-Engineering, GitLab
JAllemandou created T310578: Build and install spark3 assembly.
Tue, Jun 14, 8:30 AM · Patch-For-Review, Data Engineering Planning (Sprint 01), Data-Engineering-Kanban
JAllemandou created T310576: Update webrequest error thresholds.
Tue, Jun 14, 8:05 AM · Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T306955: Spark3 migration - Currently existing airflow jobs from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.
Tue, Jun 14, 7:38 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T306955: Spark3 migration - Currently existing airflow jobs from Done to In Code Review on the Data-Engineering-Kanban board.
Tue, Jun 14, 7:38 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou added projects to T309993: Spark 3 Migration : Data-Engineering, Data-Engineering-Kanban.
Tue, Jun 14, 7:38 AM · Data-Engineering-Kanban, Data-Engineering, Epic, Airflow
JAllemandou added projects to T307935: [Airflow] Proof of concept of Cassandra loading: Data-Engineering, Data-Engineering-Kanban.
Tue, Jun 14, 7:37 AM · Data-Engineering-Kanban, Data-Engineering, Airflow
JAllemandou renamed T306955: Spark3 migration - Currently existing airflow jobs from Plan spark3 migration - possibly incrementally to Spark3 migration - Currently existing airflow jobs.
Tue, Jun 14, 7:37 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T306955: Spark3 migration - Currently existing airflow jobs from Done to In Review on the Airflow board.
Tue, Jun 14, 7:37 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T309993: Spark 3 Migration from In Review to Epics on the Airflow board.
Tue, Jun 14, 7:37 AM · Data-Engineering-Kanban, Data-Engineering, Epic, Airflow
JAllemandou moved T299559: Wikistats reports no mobile unique devices for Wikidata and MediaWiki.org from In Progress to Paused on the Data-Engineering-Kanban board.
Tue, Jun 14, 7:36 AM · Data-Engineering-Kanban, Analytics-Wikistats, Data-Engineering, Product-Analytics
JAllemandou moved T306955: Spark3 migration - Currently existing airflow jobs from In Progress to Done on the Data-Engineering-Kanban board.
Tue, Jun 14, 7:36 AM · Airflow, Data-Engineering-Kanban, Data-Engineering

Mon, Jun 13

JAllemandou moved T308766: Fix airflow interlanguage job from In Progress to Done on the Data-Engineering-Kanban board.
Mon, Jun 13, 3:12 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou updated subscribers of T288301: AQS 2.0: Implement wikistats 2 endpoints.
Mon, Jun 13, 2:48 PM · Data-Engineering, User-Eevans, Platform Engineering Roadmap
JAllemandou added a comment to T288301: AQS 2.0: Implement wikistats 2 endpoints.

[edited with corect link- thanks @Milimetric ] Hi @BPirkle - I can't help with the entangled roots unfortunately - the poor warrior I am would not deal with any magic by any mean :)

Mon, Jun 13, 7:54 AM · Data-Engineering, User-Eevans, Platform Engineering Roadmap

Thu, Jun 9

JAllemandou edited projects for T232795: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data , added: Data-Engineering; removed Analytics.

Should we decline this ticket?

Thu, Jun 9, 1:21 PM · Data-Engineering-Icebox, Traffic-Icebox, SRE
JAllemandou closed T265516: Add cache to MaxMindDB setup as Resolved.

I think we can close this indeed. Thanks @BTullis .

Thu, Jun 9, 1:02 PM · Analytics
JAllemandou closed T303993: Add the commons-entity dataset to the refinery-drop-mediawiki-snapshots script as Resolved.
Thu, Jun 9, 6:52 AM · Data-Engineering-Kanban, Data-Engineering
JAllemandou closed T303988: Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file, a subtask of T305591: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link, as Resolved.
Thu, Jun 9, 6:52 AM · Data-Engineering
JAllemandou closed T303988: Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file, a subtask of T306611: Review refinery scripts so that they no longer depend on _SUCCESS files, as Resolved.
Thu, Jun 9, 6:52 AM · Data-Engineering
JAllemandou closed T303988: Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file as Resolved.
Thu, Jun 9, 6:52 AM · Data-Engineering-Kanban, Data-Engineering

Wed, Jun 8

JAllemandou awarded T309563: [Airflow] URLSensor might be preventing alerts to fire correctly a Hungry Hippo token.
Wed, Jun 8, 6:50 PM · Data Engineering Planning, Airflow

Jun 6 2022

JAllemandou awarded T309738: Move Mediawiki QueryPages computation to Hadoop a Party Time token.
Jun 6 2022, 3:14 PM · Data-Persistence (Consultation), Data-Engineering

Jun 2 2022

JAllemandou awarded T304373: Also intake Network Error Logging events into the Analytics Data Lake a Hungry Hippo token.
Jun 2 2022, 7:06 AM · Data-Engineering, SRE

May 25 2022

JAllemandou closed T304632: Add projects to sqoop list when synced in clouddb, a subtask of T302798: Prepare and check storage layer for shnwikivoyage, as Resolved.
May 25 2022, 8:57 PM · Data-Engineering, cloud-services-team (Kanban), Data-Services, DBA
JAllemandou closed T304632: Add projects to sqoop list when synced in clouddb, a subtask of T303761: Prepare and check storage layer for guwwiki, as Resolved.
May 25 2022, 8:57 PM · cloud-services-team (Kanban), Data-Engineering, Data-Services, DBA
JAllemandou closed T304632: Add projects to sqoop list when synced in clouddb as Resolved.

It is! Closing the task

May 25 2022, 8:57 PM · Data-Engineering

May 24 2022

JAllemandou moved T304478: Move wikireplicas dbproxy haproxy config to etcd from In Progress to Paused on the Data-Engineering-Kanban board.
May 24 2022, 6:41 PM · Patch-For-Review, Data-Engineering, Data-Services
JAllemandou moved T308168: [POC] Use airflow-installed Spark3 for an Airflow job from In Progress to Done on the Data-Engineering-Kanban board.
May 24 2022, 4:16 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T306895: Update HiveToCassandra job to read cassandra password from file from In Progress to Paused on the Data-Engineering-Kanban board.
May 24 2022, 4:13 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Cassandra
JAllemandou awarded T309097: We should have a top level maven project template based on wikimedia-discovery-discovery-parent-pom, a Yellow Medal token.
May 24 2022, 3:18 PM · Discovery-Search (Current work), Generated Data Platform

May 23 2022

JAllemandou added a comment to T306955: Spark3 migration - Currently existing airflow jobs.

decisions for Spark3:

  • We're gonna merge and release the refinery-source patch bumping Spark and Scala as is, changing refinery-source verison to 0.2.0 (not all jobs have been tested, the list is documented in the commit message)
  • We're gonna use this new refinery-source release to migrate existing Airflow jobs to Spark3, using the SaprkNoCLIDriver in cluster mode instead of the skein in client mode deploy strategy. Some airflow hacking might be needed here.
  • The merge of the refinery-source code doesn't impact already running jobs as we refence jars by version. However it means that any new change to scala code needs to be done in Scala 2.12, and the relative jobs need to be migrated to Spark3 (and therefore airflow). This shall push us to migrate to airflow faster :)
May 23 2022, 4:06 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou reassigned T308168: [POC] Use airflow-installed Spark3 for an Airflow job from JAllemandou to Antoine_Quhen.
May 23 2022, 3:56 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T307447: Adapt maxExecutors value by Dag from In Progress to Done on the Airflow board.
May 23 2022, 3:42 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T304979: Update HDFS links tables as Mediawiki changes from In Progress to Paused on the Data-Engineering-Kanban board.
May 23 2022, 3:10 PM · Data-Engineering-Kanban, Research, Product-Analytics, Data-Engineering
JAllemandou created T308998: Investigate CPU usage on an-launcher1002.
May 23 2022, 7:05 AM · Data-Engineering

May 20 2022

JAllemandou added a comment to T308356: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate.

There is work in hive on that front but AFAICS it's for version 3+: https://github.com/apache/hive/blob/master/kafka-handler/README.md

May 20 2022, 7:17 AM · Patch-For-Review, Data-Engineering-Kanban, Data-Engineering, Event-Platform, Generated Data Platform

May 19 2022

JAllemandou moved T307447: Adapt maxExecutors value by Dag from Next Up to Done on the Data-Engineering-Kanban board.
May 19 2022, 5:10 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou created T308767: Fix api_daily job.
May 19 2022, 3:38 PM · Patch-For-Review, Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou created T308766: Fix airflow interlanguage job.
May 19 2022, 3:36 PM · Airflow, Data-Engineering-Kanban, Data-Engineering

May 13 2022

JAllemandou moved T307447: Adapt maxExecutors value by Dag from Backlog to In Progress on the Airflow board.
May 13 2022, 1:05 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou claimed T307447: Adapt maxExecutors value by Dag.
May 13 2022, 1:05 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou updated the task description for T308168: [POC] Use airflow-installed Spark3 for an Airflow job.
May 13 2022, 10:13 AM · Airflow, Data-Engineering-Kanban, Data-Engineering

May 12 2022

JAllemandou added a comment to T307799: Ensure AQS Cassandra client connections are multi-datacenter.

From what I have seen we can't specify neither a load-balancing policy nor the local datacenter. BUT, from the docs: Connections are never made to data centers other than the data center of spark.cassandra.connection.host. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

May 12 2022, 5:28 PM · Data-Engineering, Cassandra
JAllemandou added a subtask for T291464: Upgrade analytics-hadoop to Spark 3 + scala 2.12: T308168: [POC] Use airflow-installed Spark3 for an Airflow job.
May 12 2022, 9:54 AM · Epic, Data-Engineering
JAllemandou added a parent task for T308168: [POC] Use airflow-installed Spark3 for an Airflow job: T291464: Upgrade analytics-hadoop to Spark 3 + scala 2.12.
May 12 2022, 9:54 AM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou closed T307779: Throttle big monthly jobs - network saturation as Resolved.
May 12 2022, 7:13 AM · Data-Engineering-Kanban, Data-Engineering
JAllemandou added a comment to T307799: Ensure AQS Cassandra client connections are multi-datacenter.

The spark-cassandra-connector indeed supports setting the consistency. It defaults to LOCAL_QUORUM for writes and LOCAL_ONE for reads.

May 12 2022, 7:10 AM · Data-Engineering, Cassandra

May 11 2022

JAllemandou moved T307779: Throttle big monthly jobs - network saturation from Ready to Deploy to Done on the Data-Engineering-Kanban board.
May 11 2022, 6:19 PM · Data-Engineering-Kanban, Data-Engineering
JAllemandou moved T308168: [POC] Use airflow-installed Spark3 for an Airflow job from Next Up to In Progress on the Data-Engineering-Kanban board.
May 11 2022, 6:19 PM · Airflow, Data-Engineering-Kanban, Data-Engineering
JAllemandou created T308168: [POC] Use airflow-installed Spark3 for an Airflow job.
May 11 2022, 5:15 PM · Airflow, Data-Engineering-Kanban, Data-Engineering

May 10 2022

JAllemandou moved T307779: Throttle big monthly jobs - network saturation from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.
May 10 2022, 5:09 PM · Data-Engineering-Kanban, Data-Engineering
JAllemandou added a comment to T306962: Use airflow to load cassandra.

Actually there would some difference, as using Spark3 would make the related HQL queries in the form:

INSERT INTO aqs.local_group_default_T_pageviews_per_project_v2.data SELECT ...

instead of just the select part. But the select part would be the same though...
The concern with waiting (I forgot to mention above) is that it prevents us from decommissioning the old AQS nodes.
If we decide to wait for spark3, we should update the oozie jobs.

May 10 2022, 7:49 AM · Data-Engineering, Airflow
JAllemandou closed T305591: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link as Resolved.
May 10 2022, 7:32 AM · Data-Engineering
JAllemandou updated subscribers of T306962: Use airflow to load cassandra.

After a great talk with @Antoine_Quhen a wider discussion needs to happen: Spark3 offers the possibility to write to cassandra through SQL-like queries (see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md).

May 10 2022, 7:12 AM · Data-Engineering, Airflow
JAllemandou added a comment to T307799: Ensure AQS Cassandra client connections are multi-datacenter.

@Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster topology asking to the known host(s). However I think it'd be good to restrict data loading from hadoop to just eqiad, to prevent pushing too much data at once accross DCs. In my mind we would restrict sending data to just eqiad hosts, and let cassandra replicate the data (possibly with throuput limitation). Is that even feasible? Is it a good idea? Thoughts welcome :)

May 10 2022, 6:58 AM · Data-Engineering, Cassandra

May 9 2022

JAllemandou added a subtask for T306962: Use airflow to load cassandra: T307935: [Airflow] Proof of concept of Cassandra loading.
May 9 2022, 3:55 PM · Data-Engineering, Airflow
JAllemandou added a parent task for T307935: [Airflow] Proof of concept of Cassandra loading: T306962: Use airflow to load cassandra.
May 9 2022, 3:55 PM · Data-Engineering-Kanban, Data-Engineering, Airflow
JAllemandou renamed T307935: [Airflow] Proof of concept of Cassandra loading from Migrate Cassandra Jobs to Migrate Cassandra pageview-per-project-hourly Job.
May 9 2022, 3:54 PM · Data-Engineering-Kanban, Data-Engineering, Airflow
JAllemandou moved T300028: Low Risk Oozie Migration: APIs from Ready to Deploy to Done on the Data-Engineering-Kanban board.
May 9 2022, 3:05 PM · Data-Engineering-Kanban, Data-Engineering, Airflow