Page MenuHomePhabricator

Antoine_Quhen (aqu)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 4 2022, 1:16 PM (118 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
AQuhen (WMF) [ Global Accounts ]

Recent Activity

Thu, Apr 4

Antoine_Quhen renamed T356762: [Refine refactoring] Extract refine schema management into a dedicated tool from [NEEDS GROOMING][SPIKE] Extract refine schema management into a dedicated tool to Extract refine schema management into a dedicated tool.
Thu, Apr 4, 3:57 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Patch-For-Review

Wed, Mar 27

Antoine_Quhen added a comment to T360967: [Developer Experience] Implement CI hql Linting.

https://docs.sqlfluff.com/en/stable/dialects.html#hive

Wed, Mar 27, 11:08 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Tue, Mar 26

Antoine_Quhen closed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow as Resolved.
Tue, Mar 26, 4:18 PM · Data-Engineering (Sprint 9), Data Pipelines
Antoine_Quhen closed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow, a subtask of T307505: Refine jobs should be scheduled by Airflow, as Resolved.
Tue, Mar 26, 4:17 PM · Data-Engineering, Data Pipelines
Antoine_Quhen moved T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store from In progress to Done on the Data-Engineering (Sprint 9) board.

5 datasets are being refined as a POC on the prod cluster. 2 on the test cluster.

Tue, Mar 26, 4:17 PM · Data-Engineering (Sprint 9)
Antoine_Quhen added a comment to T357430: Airflow mapped tasks UI & metrics.

The Airflow PR has been merged and should be released in Airflow 2.9 in April.

Tue, Mar 26, 4:15 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Feb 22 2024

Antoine_Quhen closed T311111: Improve speed of Gitlab CI as Resolved.

Done. The last version was done with Blubber:

Feb 22 2024, 3:14 PM · Data-Engineering, GitLab (CI & Job Runners), Performance Issue

Feb 21 2024

Antoine_Quhen added a comment to T357873: Mediawiki_wikitext_history job often has long gaps between stages.

Some research:

  • Each XML dumps snapshot may represent ~5.5TB (including ~1.8TB for wikidata and 1.4TB for enwiki)
  • The Airflow sensor may take ~19days to turn green. It waits until the last dump has been processed (_IMPORTED flag). Most dumps are generated in a matter of days (~4 on average, maybe). Enwiki may take 7 days. And they all wait for the wikidata dump (~19 days).
  • When the sensor turns green, a heavy Spark job is launched to convert all the compressed XML to parquet. ~5.5TB (compressed) is taking ~4.5 days to process.
  • The perceived gaps are due to the non-parallelism of the dag + very long jobs. 1 heavy job is preventing the other ones from running. due to the retries (Thx for the pointer @JAllemandou ). Other symptoms, same problem here I think: https://phabricator.wikimedia.org/T342911
Feb 21 2024, 4:53 PM · Data Products, Data-Engineering, Movement-Insights

Feb 13 2024

Antoine_Quhen added a subtask for T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store: T357430: Airflow mapped tasks UI & metrics.
Feb 13 2024, 3:39 PM · Data-Engineering (Sprint 9)
Antoine_Quhen added a parent task for T357430: Airflow mapped tasks UI & metrics: T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store.
Feb 13 2024, 3:39 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
Antoine_Quhen created T357430: Airflow mapped tasks UI & metrics.
Feb 13 2024, 3:39 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Feb 8 2024

Antoine_Quhen moved T352672: [Iceberg Migration] Migrate session length tables to Iceberg from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 5:10 PM · Data-Engineering (Sprint 8)

Feb 6 2024

Antoine_Quhen closed T356364: [Maintenance] Migrate Gitlab CI to blubber as Resolved.
Feb 6 2024, 1:24 PM · Data-Engineering (Sprint 8)
Antoine_Quhen closed T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner as Resolved.
Feb 6 2024, 1:24 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen closed T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner, a subtask of T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs, as Resolved.
Feb 6 2024, 1:24 PM · Data-Engineering (Sprint 6), Patch-For-Review

Feb 2 2024

Antoine_Quhen moved T356362: [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow from Next Up to In progress on the Data-Engineering (Sprint 8) board.
Feb 2 2024, 10:29 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
Antoine_Quhen changed the status of T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner from Open to In Progress.
Feb 2 2024, 10:08 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen changed the status of T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner, a subtask of T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs, from Open to In Progress.
Feb 2 2024, 10:08 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a comment to T356364: [Maintenance] Migrate Gitlab CI to blubber.

Linked with: https://phabricator.wikimedia.org/T351792

Feb 2 2024, 10:08 AM · Data-Engineering (Sprint 8)

Jan 31 2024

Antoine_Quhen renamed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow from Refactor and migrate navigationtiming to Airflow to [Refine refactoring] Refactor and migrate navigationtiming to Airflow.
Jan 31 2024, 3:52 PM · Data-Engineering (Sprint 9), Data Pipelines
Antoine_Quhen added a comment to T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.

wmf.wikidata_item_page_link/snapshot=2024-01-15 looked fine from a row count perspective but 2024-01-22 snapshot was unexpectedly available and had zero rows:

select count(*) from wmf.wikidata_item_page_link where snapshot='2024-01-22';
...
+------+
| _c0  |
+------+
| 0    |
+------+
Jan 31 2024, 12:59 PM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen moved T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 from Next Up to Done on the Data-Engineering (Sprint 8) board.
Jan 31 2024, 12:56 PM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen changed the status of T352672: [Iceberg Migration] Migrate session length tables to Iceberg from Open to In Progress.
Jan 31 2024, 12:54 PM · Data-Engineering (Sprint 8)
Antoine_Quhen changed the status of T352672: [Iceberg Migration] Migrate session length tables to Iceberg, a subtask of T333013: [Iceberg Migration] Apache Iceberg Migration, from Open to In Progress.
Jan 31 2024, 12:53 PM · Data-Engineering, Epic

Jan 30 2024

Antoine_Quhen updated Other Assignee for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15, added: Antoine_Quhen.
Jan 30 2024, 11:12 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen updated the task description for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.
Jan 30 2024, 11:12 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen added a comment to T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.

I've added the missing partitions in the source of wikidata_item_page_link and its missing snapshot is now generated.

Jan 30 2024, 11:11 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen edited projects for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15, added: Data-Engineering (Sprint 7); removed Data-Engineering.
Jan 30 2024, 9:49 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions

Jan 29 2024

Antoine_Quhen moved T352672: [Iceberg Migration] Migrate session length tables to Iceberg from Next Up to In progress on the Data-Engineering (Sprint 7) board.
Jan 29 2024, 2:46 PM · Data-Engineering (Sprint 8)
Antoine_Quhen claimed T352672: [Iceberg Migration] Migrate session length tables to Iceberg.
Jan 29 2024, 2:46 PM · Data-Engineering (Sprint 8)
Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from In Review to Done on the Data-Engineering (Sprint 7) board.
Jan 29 2024, 2:45 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen moved T354695: [Iceberg Migration] Define sensor concept and implementation plan from Radar (External Teams) to In Review on the Data-Engineering (Sprint 7) board.

https://docs.google.com/document/d/1upAje5lMawu4X6seRxI8Lx7YN-oHEzcEcm2fO6E5OH0/edit

Jan 29 2024, 1:55 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
Antoine_Quhen added a comment to T338065: [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization.

I would like to add rewrite_manifests to the list of maintenance actions:

Jan 29 2024, 10:25 AM · Data-Engineering

Jan 25 2024

Antoine_Quhen closed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables as Resolved.
Jan 25 2024, 2:58 PM · Data-Engineering (Sprint 7)
Antoine_Quhen closed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables, a subtask of T333013: [Iceberg Migration] Apache Iceberg Migration, as Resolved.
Jan 25 2024, 2:58 PM · Data-Engineering, Epic
Antoine_Quhen closed T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey as Resolved.
Jan 25 2024, 2:57 PM · Data-Engineering (Sprint 7)
Antoine_Quhen moved T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey from In Review to Done on the Data-Engineering (Sprint 7) board.
Jan 25 2024, 2:57 PM · Data-Engineering (Sprint 7)

Jan 22 2024

Antoine_Quhen updated subscribers of T354695: [Iceberg Migration] Define sensor concept and implementation plan.

@BTullis , @brouberol , @Stevemunene I would like your feedback on this subject:

Jan 22 2024, 10:02 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 19 2024

Antoine_Quhen claimed T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey.
Jan 19 2024, 2:53 PM · Data-Engineering (Sprint 7)

Jan 17 2024

Antoine_Quhen moved T354695: [Iceberg Migration] Define sensor concept and implementation plan from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 7) board.
Jan 17 2024, 11:24 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 16 2024

Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from In progress to In Review on the Data-Engineering (Sprint 7) board.
Jan 16 2024, 2:20 PM · Data-Engineering (Sprint 7)

Jan 15 2024

Antoine_Quhen added a comment to T343232: Configure Airflow to send metrics to Prometheus.

Following the Grafana dashboard review, I've performed some changes to it:

  • I distributed the graphs into 4 sections: Failures, Durations, Counts, Scheduling
  • I added missing parameterization of variables (e.g the list of operators changes when the instance is selected)
  • I updated TTLs into statsd-exporter in order to reflect into Prometheus the rate of the metrics sent by Airflow (e.g. when Airflow emits a metric to say 1 task is in failure, it generates 1 call. and we would like to get this single point into Prometheus)
  • I isolated task failures into its own graph
Jan 15 2024, 1:37 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics

Jan 9 2024

Antoine_Quhen created T354703: analytics/refinery scap deploy on test cluster fails with permission error.
Jan 9 2024, 9:17 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering
Antoine_Quhen added a comment to T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.

1/ Splitting the CI

Jan 9 2024, 4:51 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering

Dec 21 2023

Antoine_Quhen added a comment to T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables.

In this ticket, I needed to find a way to detect when an Iceberg table has some data in it. This would replace the Hive partition sensor when migrating a table to Iceberg.
We chose to launch a Spark application running an SQL count. It's now implemented here:

A drawback of this solution is that it generates a FAILED Spark application each time the sensor does not find any data in the interval. When monitoring our Spark applications, we want to avoid artificially growing the FAILED counts.

Dec 21 2023, 3:39 PM · Data-Engineering (Sprint 7)

Dec 20 2023

Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from In Review to Done on the Data-Engineering (Sprint 6) board.
Dec 20 2023, 3:19 PM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a comment to T353806: Airflow scheduler monitoring is broken since the most recent deploy.

https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manifests/airflow.pp#L199

Dec 20 2023, 2:37 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31), Data-Engineering
Antoine_Quhen added a comment to T353806: Airflow scheduler monitoring is broken since the most recent deploy.

I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPATH=/path/to/root/of/airflow-dags/ as an env variable.

Dec 20 2023, 2:35 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31), Data-Engineering

Dec 18 2023

tchin awarded T336739: Post Oozie -> Airflow migration refactorings a Barnstar token.
Dec 18 2023, 3:12 PM · Patch-For-Review, Data-Engineering, Epic, Data Pipelines
Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from In progress to In Review on the Data-Engineering (Sprint 6) board.

I have the first version of the code in review.

Dec 18 2023, 2:46 PM · Data-Engineering (Sprint 7)

Dec 5 2023

Antoine_Quhen claimed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables.
Dec 5 2023, 3:31 PM · Data-Engineering (Sprint 7)
Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from Next Up to In progress on the Data-Engineering (Sprint 6) board.
Dec 5 2023, 3:31 PM · Data-Engineering (Sprint 7)

Dec 4 2023

Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

In this puppet patch, we are adding configuration to send more Airflow metrics to Prometheus, and to customize them.

Dec 4 2023, 8:55 AM · Data-Engineering (Sprint 6), Patch-For-Review

Nov 30 2023

Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from In progress to In Review on the Data-Engineering (Sprint 5) board.
Nov 30 2023, 5:04 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from In progress to In Review on the Data-Engineering (Sprint 5) board.
Nov 30 2023, 5:04 PM · Data-Engineering (Sprint 6), Patch-For-Review

Nov 22 2023

Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

The workaround to our Gilab-CI pb is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537

Nov 22 2023, 3:04 PM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a subtask for T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs: T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.
Nov 22 2023, 9:37 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a parent task for T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner: T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.
Nov 22 2023, 9:37 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

The puppet configuration is now merged, and statsd_exporter is running on an-test-client1002. Analytics Prometheus is scrapping from it, as it should.

Nov 22 2023, 9:35 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen created T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.
Nov 22 2023, 9:16 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering

Nov 16 2023

Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from Blocked/Paused to In progress on the Data-Engineering (Sprint 5) board.
Nov 16 2023, 10:15 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from Blocked/Paused to In progress on the Data-Engineering (Sprint 5) board.
Nov 16 2023, 10:15 AM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics

Nov 13 2023

Antoine_Quhen claimed T343232: Configure Airflow to send metrics to Prometheus.
Nov 13 2023, 4:20 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a comment to T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics.

We have multiple needs considering scheduling Airflow dags & tasks.

Nov 13 2023, 3:06 PM · Data-Engineering (Sprint 8), Patch-For-Review
Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from In progress to Blocked/Paused on the Data-Engineering (Sprint 5) board.
Nov 13 2023, 12:59 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from In progress to Blocked/Paused on the Data-Engineering (Sprint 5) board.
Nov 13 2023, 12:53 PM · Data-Engineering (Sprint 6), Patch-For-Review

Nov 8 2023

Antoine_Quhen updated subscribers of T330176: [Data Platform] Deploy Spark History Service.
Nov 8 2023, 3:05 PM · Data-Engineering (Sprint 7), Patch-For-Review, Data-Platform-SRE

Nov 7 2023

Antoine_Quhen updated the task description for T350587: Remove Java 8 images from integration/config.
Nov 7 2023, 9:58 AM · Continuous-Integration-Infrastructure, Release-Engineering-Team

Oct 27 2023

Antoine_Quhen added a comment to T343232: Configure Airflow to send metrics to Prometheus.

Running locally with Docker, I manage to fetch a sample:

airflow_schedulerjob_start{instance="airflow-statsd-exporter:9102", job="statsd-exporter"} 3

Where:

  • The airflow_ prefix was added in airflow.
  • airflow-statsd-exporter:9102 is an independent container
Oct 27 2023, 1:42 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a comment to T349764: [Data Quality] Log selected Spark metrics and visualize on dashboard.

After all the discussion on this subject, I also think that publishing Spark metrics to Kafka (then exported to hdfs) seems like the most obvious first step.

Oct 27 2023, 7:57 AM · Data-Engineering

Oct 26 2023

Antoine_Quhen added a comment to T346280: [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests.

Wikitech page updated: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest

Oct 26 2023, 3:53 PM · Data Engineering and Event Platform Team (Sprint 4)

Oct 24 2023

Antoine_Quhen updated subscribers of T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

Hello @fgiunchedi ! This ticket, when done, will add some Airflow metrics to Prometheus:

  • a very reasonable amount (less than a hundred thousand)
  • prefixed with the name of the airflow instance
  • passing through a statsd_exporter

Just to let you know 😄

Oct 24 2023, 4:44 PM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 4) board.
Oct 24 2023, 3:54 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a project to T343232: Configure Airflow to send metrics to Prometheus: Data Engineering and Event Platform Team (Sprint 4).
Oct 24 2023, 3:54 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen updated Other Assignee for T343232: Configure Airflow to send metrics to Prometheus, added: Antoine_Quhen.
Oct 24 2023, 3:52 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a subtask for T343232: Configure Airflow to send metrics to Prometheus: T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.
Oct 24 2023, 8:31 AM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a parent task for T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs: T343232: Configure Airflow to send metrics to Prometheus.
Oct 24 2023, 8:31 AM · Data-Engineering (Sprint 6), Patch-For-Review
JAllemandou awarded T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs a Yellow Medal token.
Oct 24 2023, 8:28 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen claimed T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.
Oct 24 2023, 8:18 AM · Data-Engineering (Sprint 6), Patch-For-Review

Oct 23 2023

Antoine_Quhen added a comment to T297231: [Data Quality] Sending Apache Spark metrics to PushGateway.

Yes, Druid is an option. There is a Grafana collector, and Druid is designed for time-series.

Oct 23 2023, 12:06 PM · Data-Engineering, Observability-Metrics

Oct 20 2023

Antoine_Quhen added a comment to T297231: [Data Quality] Sending Apache Spark metrics to PushGateway.

@Ottomata , yes, collecting to HDFS is completely possible yes. e.g. with a Spark metrics KafkaSink (I've seen one) + Gobblin.

Oct 20 2023, 3:36 PM · Data-Engineering, Observability-Metrics
Antoine_Quhen updated subscribers of T304615: Airflow scheduler and webserver logs should be readable by airflow instance admins.
Oct 20 2023, 2:18 PM · Data-Engineering, Data-Platform-SRE
Antoine_Quhen updated subscribers of T304615: Airflow scheduler and webserver logs should be readable by airflow instance admins.
Oct 20 2023, 2:17 PM · Data-Engineering, Data-Platform-SRE

Oct 18 2023

Antoine_Quhen moved T297231: [Data Quality] Sending Apache Spark metrics to PushGateway from In progress to In Review on the Data Engineering and Event Platform Team (Sprint 3) board.

As a closing word, two serious options remain:

  1. Prometheus with a low resolution (1 point per minute for any metrics). This is OK to monitor evolution between runs but usually not for debugging a Spark job.
  2. Send metrics-data with high time resolution to a new ad-hoc backend accessible from Grafana (InfluxDB or PG are good candidates)
Oct 18 2023, 2:35 PM · Data-Engineering, Observability-Metrics

Oct 17 2023

Antoine_Quhen added a comment to T297231: [Data Quality] Sending Apache Spark metrics to PushGateway.

Knowing the estimated metrics cardinality looks OK for the current Prometheus setup is good.

Oct 17 2023, 12:35 PM · Data-Engineering, Observability-Metrics
Antoine_Quhen closed T306193: Cleanup analytics/refinery/source pom.files as Declined.
Oct 17 2023, 8:46 AM · Data-Engineering

Oct 16 2023

Antoine_Quhen added a comment to T343232: Configure Airflow to send metrics to Prometheus.

@Ahoelzl , as a complement to T297231 , we could get from those Airflow metrics:

  • monitoring/alerting of the global number of failures (usually hidden by the retries)
  • monitoring/alerting of the duration of specific tasks (like the Spark ones)
Oct 16 2023, 8:56 AM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen added a comment to T321512: Install jupyterhub separately from conda-analytics.

+1 for the simplicity induced in the conda-analytics package.

Oct 16 2023, 8:42 AM · Data-Engineering, Data-Platform-SRE, Data Pipelines
Antoine_Quhen closed T347491: Scap deployment on Hadoop test cluster broken as Resolved.

Thanks @BTullis ! I've tested it. It's now resolved!

Oct 16 2023, 8:32 AM · Data-Platform-SRE, Data Engineering and Event Platform Team

Oct 13 2023

Antoine_Quhen moved T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one from Blocked/Paused to In Review on the Data Engineering and Event Platform Team (Sprint 3) board.
Oct 13 2023, 3:57 PM · Data Engineering and Event Platform Team (Sprint 3)

Oct 12 2023

Antoine_Quhen added a comment to T347586: [Maintenance] Delete sanitized events removed from sanitization list.

Here is the list of events concerned:

  • Edit
  • Diacritics*
  • Some prefixed with MobileWeb:
    • MobileWebBrowse
    • MobileWebClickTracking
    • MobileWebCta
    • MobileWebDiffClickTracking
    • MobileWebEditing
    • MobileWebInfobox
    • MobileWebLanguageSwitcher
    • MobileWebMainMenuClickTracking
    • MobileWebSectionUsage
    • MobileWebShareButton
    • MobileWebUIClickTracking
    • MobileWebUploads
    • MobileWebWatching*
    • MobileWebWikiGrok*
  • TaskRecommendation*
  • TestSearchSatisfaction
  • VET135171
  • WikipediaZeroUsage
Oct 12 2023, 1:42 PM · Data-Engineering (Sprint 8)
Antoine_Quhen moved T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one from In progress to Blocked/Paused on the Data Engineering and Event Platform Team (Sprint 3) board.
Oct 12 2023, 12:58 PM · Data Engineering and Event Platform Team (Sprint 3)
Antoine_Quhen added a comment to T297231: [Data Quality] Sending Apache Spark metrics to PushGateway.

Let's summarize our current problem.

Oct 12 2023, 10:14 AM · Data-Engineering, Observability-Metrics

Oct 11 2023

Antoine_Quhen moved T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 3) board.
Oct 11 2023, 8:12 AM · Data Engineering and Event Platform Team (Sprint 3)
Antoine_Quhen claimed T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.
Oct 11 2023, 8:11 AM · Data Engineering and Event Platform Team (Sprint 3)

Oct 10 2023

Antoine_Quhen added a comment to T347296: Add Antoine_Quhen to the deployment group .

@Jelto done. wikitech username & email address checked. Thanks!

Oct 10 2023, 8:38 AM · SRE, SRE-Access-Requests, Data-Engineering, Data-Platform-SRE, Event-Platform, Data Engineering and Event Platform Team
Antoine_Quhen updated the task description for T347296: Add Antoine_Quhen to the deployment group .
Oct 10 2023, 8:37 AM · SRE, SRE-Access-Requests, Data-Engineering, Data-Platform-SRE, Event-Platform, Data Engineering and Event Platform Team

Sep 29 2023

Antoine_Quhen updated subscribers of T345912: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents.
Sep 29 2023, 2:42 PM · Data-Engineering, Epic
Antoine_Quhen updated subscribers of T347706: [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for Unique Devices.
Sep 29 2023, 2:31 PM · Data Engineering and Event Platform Team (Sprint 4)

Sep 28 2023

Antoine_Quhen moved T346280: [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests from In progress to In Review on the Data Engineering and Event Platform Team (Sprint 2) board.

Here is the first version of it. Please review in the gdoc: https://docs.google.com/document/d/1clSe6bnIxJUdd2LaFtQ-_MXGIRFBOzKNVZ2130qfGqI/edit

Sep 28 2023, 8:37 AM · Data Engineering and Event Platform Team (Sprint 4)