Page MenuHomePhabricator

Antoine_Quhen (aqu)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jan 4 2022, 1:16 PM (213 w, 5 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
AQuhen (WMF) [ Global Accounts ]

Recent Activity

Wed, Feb 4

Antoine_Quhen added a comment to T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow.

On this ticket, we have consulted both SREs and our team. We have agreed on the following details:

  • extract a single repo refinery-python to Gitlab from analytics/refinery
  • multiple CI output for this repo with multiple pipeline depending on the need:
    • docker image (seems like the best solution for Airflow triggering)
    • conda package
  • and the repo could be pip compatible to be eventually required from conda-analytics or airflow-dags
Wed, Feb 4, 9:58 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Feb 3

Antoine_Quhen created T416315: Fix db_cleaner for small instances.
Tue, Feb 3, 10:36 AM · Data-Engineering

Mon, Feb 2

Antoine_Quhen created T416248: GitLab Private Repository Request for: Data Engineering Bot Detection Pipeline.
Mon, Feb 2, 9:22 PM · User-brennen, Release-Engineering-Team, GitLab

Thu, Jan 29

Antoine_Quhen placed T392668: refine_to_hive dag optimizations up for grabs.
Thu, Jan 29, 1:47 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen placed T381072: [Refine Simplification] Remove Schema Merging in Refine Process by Enforcing Backward Compatibility up for grabs.
Thu, Jan 29, 1:46 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen placed T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps up for grabs.
Thu, Jan 29, 12:55 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
Antoine_Quhen moved T415874: Extract bot classification into new repo from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Jan 29, 8:36 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T415874: Extract bot classification into new repo.
Thu, Jan 29, 8:36 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Wed, Jan 28

Antoine_Quhen closed T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances, a subtask of T411988: Airflow main performance instance optimization, as Resolved.
Wed, Jan 28, 7:01 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances as Resolved.

Closing. Next optimization could be splitting from main instance file_exporters job. Should be done in another ticket.

Wed, Jan 28, 7:01 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen updated subscribers of T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow.

Following discussion with @Ahoelzl we can postpone that on Q4.

Wed, Jan 28, 6:59 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen placed T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow up for grabs.
Wed, Jan 28, 6:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a comment to T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion.

I've marked all failed dag run as success to clear the UI.

Wed, Jan 28, 6:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T411989: Optimize canary event generation resources consumption on Airflow from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.

K8s execution deployed, but we are not observing the overall performance gain we would have expected. We later tweak those 2 Airflow configs:

  • worker_pods_creation_batch_size
  • worker_pods_queued_check_interval

At least each task is consuming less k8s resources.

Wed, Jan 28, 3:43 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Jan 27

Antoine_Quhen closed T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion as Resolved.
Tue, Jan 27, 3:59 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a comment to T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion.

Build creation has been moved here: https://gitlab.wikimedia.org/repos/data-engineering/datahub-cli

Tue, Jan 27, 3:41 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a comment to T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow.

With T415357 I’ve already started extracting the Python conda environment build for analytics/refinery into GitLab CI.

Tue, Jan 27, 1:13 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a comment to T414953: Refine generates very large XCOM values.
  • Option 1:
    • downstreaming xcoms (keeping only the necessary fields)
    • computing parameters for each task in a pre_execution function (1 for each task)
Tue, Jan 27, 11:09 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a comment to T411989: Optimize canary event generation resources consumption on Airflow.

First deploy crashed because and reverted.
I was blocked by missing connection from k8s to eventgates.
It was fixed by SREs: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1229524
Now testing on dev-env before retrying a deploy.

Tue, Jan 27, 10:47 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mon, Jan 26

Antoine_Quhen moved T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mon, Jan 26, 1:05 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Fri, Jan 23

Antoine_Quhen claimed T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion.
Fri, Jan 23, 1:55 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T415357: Upgrade DataHub CLI virtualenv used by metadata_ingest_daily to restore Druid ingestion.
Fri, Jan 23, 12:44 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Thu, Jan 22

Antoine_Quhen moved T414953: Refine generates very large XCOM values from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Jan 22, 6:27 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen claimed T414953: Refine generates very large XCOM values.
Thu, Jan 22, 6:26 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Jan 20

Antoine_Quhen closed T407103: SDS 1.3.6 SPUR bot detection analysis, a subtask of T408656: SDS 1.3.6 Improved bot detection using Spur data set, as Resolved.
Tue, Jan 20, 4:43 PM · Data-Engineering-Roadmap, Epic, OKR-Work
Antoine_Quhen closed T407103: SDS 1.3.6 SPUR bot detection analysis as Resolved.
Tue, Jan 20, 4:43 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Jan 20, 3:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen removed a project from T414815: Migrate refinery-drop-older-than from refinery to Airflow: Data-Engineering (Q3 FY25/26 January 1st - March 31th).
Tue, Jan 20, 2:16 PM
Antoine_Quhen closed T414815: Migrate refinery-drop-older-than from refinery to Airflow as Declined.

Duplicate of T411999

Tue, Jan 20, 2:16 PM

Mon, Jan 19

Antoine_Quhen moved T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.

Deployment of db_cleaner dag on Airflow instances went mostly well.

Mon, Jan 19, 10:31 AM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T414810: Switch Refine DAG priority weights to absolute from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mon, Jan 19, 10:20 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Sat, Jan 17

Antoine_Quhen claimed T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow.
Sat, Jan 17, 1:43 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T414810: Switch Refine DAG priority weights to absolute from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Sat, Jan 17, 1:27 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T414810: Switch Refine DAG priority weights to absolute from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Sat, Jan 17, 1:03 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Fri, Jan 16

Antoine_Quhen created T414815: Migrate refinery-drop-older-than from refinery to Airflow.
Fri, Jan 16, 3:35 PM
Antoine_Quhen created T414810: Switch Refine DAG priority weights to absolute.
Fri, Jan 16, 3:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Jan 13

Antoine_Quhen moved T414363: SDS 1.3.5 Basic Client Side Signal Analysis from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.

Last notebook is here:
https://gitlab.wikimedia.org/hghani/movement-insights-requests/-/blob/main/SDS%201.3/client-side/simple-client_analysis_summary.ipynb?ref_type=heads
I reviewed it.

Tue, Jan 13, 3:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mon, Jan 12

Antoine_Quhen edited projects for T381072: [Refine Simplification] Remove Schema Merging in Refine Process by Enforcing Backward Compatibility, added: Data-Engineering (Q3 FY25/26 January 1st - March 31th); removed Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Mon, Jan 12, 11:05 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration, a subtask of T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation, as Declined.
Mon, Jan 12, 11:04 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
Antoine_Quhen closed T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration, a subtask of T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment, as Declined.
Mon, Jan 12, 11:04 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Patch-For-Review
Antoine_Quhen closed T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration, a subtask of T392668: refine_to_hive dag optimizations, as Declined.
Mon, Jan 12, 11:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration, a subtask of T411988: Airflow main performance instance optimization, as Declined.
Mon, Jan 12, 11:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration as Declined.

As we are discussing the limits of the current system putting strains on Airflow. The idea of this now old refactoring seems not a priority.

Mon, Jan 12, 11:04 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen edited projects for T392668: refine_to_hive dag optimizations, added: Data-Engineering (Q3 FY25/26 January 1st - March 31th); removed Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Mon, Jan 12, 10:53 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen edited projects for T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps, added: Data-Engineering (Q3 FY25/26 January 1st - March 31th); removed Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Mon, Jan 12, 10:51 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
Antoine_Quhen moved T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances from Next Up to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mon, Jan 12, 10:26 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen claimed T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances.
Mon, Jan 12, 10:21 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T414363: SDS 1.3.5 Basic Client Side Signal Analysis from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mon, Jan 12, 4:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen changed the status of T414363: SDS 1.3.5 Basic Client Side Signal Analysis from Open to In Progress.
Mon, Jan 12, 4:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T414363: SDS 1.3.5 Basic Client Side Signal Analysis.
Mon, Jan 12, 4:56 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 23 2025

Antoine_Quhen moved T392668: refine_to_hive dag optimizations from In progress to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 23 2025, 5:05 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen claimed T411989: Optimize canary event generation resources consumption on Airflow.
Dec 23 2025, 2:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 10 2025

Antoine_Quhen added a comment to T411990: Analyze and optimize Airflow Postgres backend performance.

Patch adding pg_stat to analytics_test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1217138

Dec 10 2025, 11:06 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 8 2025

Antoine_Quhen closed T406526: GobblinLastSuccessfulRunTooLongAgo alerts as Resolved.
Dec 8 2025, 4:31 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen moved T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps from In progress to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
Antoine_Quhen created T411999: Migrate cleanup jobs for snapshot datasets from systemd timers to Airflow.
Dec 8 2025, 10:28 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration: T411988: Airflow main performance instance optimization.
Dec 8 2025, 10:02 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a parent task for T392668: refine_to_hive dag optimizations: T411988: Airflow main performance instance optimization.
Dec 8 2025, 10:02 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T411989: Optimize canary event generation resources consumption on Airflow: T411988: Airflow main performance instance optimization.
Dec 8 2025, 10:02 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T411990: Analyze and optimize Airflow Postgres backend performance: T411988: Airflow main performance instance optimization.
Dec 8 2025, 10:02 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances: T411988: Airflow main performance instance optimization.
Dec 8 2025, 10:02 AM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added subtasks for T411988: Airflow main performance instance optimization: T392668: refine_to_hive dag optimizations, T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration, T411989: Optimize canary event generation resources consumption on Airflow, T411990: Analyze and optimize Airflow Postgres backend performance, T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances.
Dec 8 2025, 10:02 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T411992: Reduce main Airflow DB size and consider splitting heavy workloads into separate instances.
Dec 8 2025, 10:02 AM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T411990: Analyze and optimize Airflow Postgres backend performance.
Dec 8 2025, 10:00 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T411989: Optimize canary event generation resources consumption on Airflow.
Dec 8 2025, 9:58 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen updated the task description for T392668: refine_to_hive dag optimizations.
Dec 8 2025, 9:55 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen created T411988: Airflow main performance instance optimization.
Dec 8 2025, 9:52 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T380856: Reduce `refine_to_hive_hourly` airflow task number as Resolved.

We have already merged 2 features to improve on that:

  • all the preparation tasks are gone
  • all evolve+refine are now a single task
Dec 8 2025, 9:26 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a subtask for T392668: refine_to_hive dag optimizations: T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration.
Dec 8 2025, 9:23 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T375064: Move more of refine_hive_hourly dag logic into RefineConfiguration: T392668: refine_to_hive dag optimizations.
Dec 8 2025, 9:23 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen renamed T392668: refine_to_hive dag optimizations from Refine to Hive with Airflow – Kubernetes Resource Optimization to refine_to_hive dag optimizations.
Dec 8 2025, 9:22 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen moved T392668: refine_to_hive dag optimizations from Blocked/Paused to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 9:16 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 4 2025

Antoine_Quhen moved T410285: SDS 1.3.6 SPUR bot detection - Productionize SPUR datasets import from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 4 2025, 5:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen moved T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 4 2025, 5:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 24 2025

Antoine_Quhen moved T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive. from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 24 2025, 5:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 17 2025

Antoine_Quhen moved T410285: SDS 1.3.6 SPUR bot detection - Productionize SPUR datasets import from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.

Lets create dags to import datasets from spur.us:

  • anonymous+residential
  • dc
  • geoips
Nov 17 2025, 4:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen created T410285: SDS 1.3.6 SPUR bot detection - Productionize SPUR datasets import.
Nov 17 2025, 4:33 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 14 2025

Antoine_Quhen moved T407103: SDS 1.3.6 SPUR bot detection analysis from In Review to Blocked/Paused on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.

Waiting for client side signal for more Spur.us dataset evaluations.

Nov 14 2025, 9:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen updated Other Assignee for T407103: SDS 1.3.6 SPUR bot detection analysis, added: Hghani.
Nov 14 2025, 9:01 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Nov 3 2025

Antoine_Quhen updated Other Assignee for T307040: Propagate field descriptions from event schemas to Hive event tables and into DataHub, added: Antoine_Quhen.
Nov 3 2025, 4:13 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Product-Analytics

Oct 27 2025

Antoine_Quhen moved T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 27 2025, 3:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
Antoine_Quhen added a project to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps: Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Oct 27 2025, 3:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
Antoine_Quhen claimed T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.
Oct 27 2025, 3:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content

Oct 24 2025

Antoine_Quhen added a comment to T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table.

All Septembre is imported:

desc wmf_traffic.spur_feed ;
col_name        data_type       comment
ip      string  NULL
organization    string  NULL
as      struct<number:bigint,organization:string>       NULL
client  struct<behaviors:array<string>,concentration:struct<city:string,country:string,density:double,geohash:string,skew:bigint,state:string>,count:bigint,countries:bigint,proxies:array<string>,spread:bigint,types:array<string>>  NULL
tunnels array<struct<anonymous:boolean,entries:array<string>,exits:array<string>,operator:string,type:string>>  NULL
services        array<string>   NULL
location        struct<city:string,state:string,country:string> NULL
risks   array<string>   NULL
snapshot        string  NULL
# Partition Information
# col_name      data_type       comment
snapshot        string  NULL
Time taken: 0.239 seconds, Fetched 12 row(s)
Oct 24 2025, 2:42 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a comment to T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals.

All September is imported.

desc aqu.20251023_bot_ips_study ;
col_name        data_type       comment
ip      string  NULL
pageviews_count bigint  NULL
legacy_reasons  array<string>   NULL
legacy_automated_pageviews_proportion   double  NULL
hap_flagged_request_proportion  double  NULL
spur_risks      array<string>   NULL
spur_proxies    array<string>   NULL
year    int     NULL
month   int     NULL
day     int     NULL
# Partition Information
# col_name      data_type       comment
year    int     NULL
month   int     NULL
day     int     NULL
Time taken: 0.147 seconds, Fetched 15 row(s)
Oct 24 2025, 2:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen renamed T407103: SDS 1.3.6 SPUR bot detection analysis from SDS 1.3.6 First analysis review to SDS 1.3.6 SPUR bot detection analysis.
Oct 24 2025, 2:12 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a subtask for T407103: SDS 1.3.6 SPUR bot detection analysis: T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals.
Oct 24 2025, 2:11 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen added a parent task for T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals: T407103: SDS 1.3.6 SPUR bot detection analysis.
Oct 24 2025, 2:11 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen changed the status of T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals from Open to In Progress.
Oct 24 2025, 2:10 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen created T408210: SDS 1.3.6 Prepare dataset of IPs crossing Spur, legacy bot detection and HAProxy signals.
Oct 24 2025, 2:10 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen closed T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table, a subtask of T407103: SDS 1.3.6 SPUR bot detection analysis, as Resolved.
Oct 24 2025, 1:01 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen closed T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table as Resolved.
Oct 24 2025, 1:01 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a parent task for T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table: T407103: SDS 1.3.6 SPUR bot detection analysis.
Oct 24 2025, 1:00 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a subtask for T407103: SDS 1.3.6 SPUR bot detection analysis: T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table.
Oct 24 2025, 1:00 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Oct 14 2025

Antoine_Quhen closed T407103: SDS 1.3.6 SPUR bot detection analysis as Resolved.

Awesome study. Thanks for adding the split by countries.

Oct 14 2025, 11:21 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Antoine_Quhen changed the status of T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table from Open to In Progress.
Oct 14 2025, 10:16 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen added a comment to T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table.

here is the new table: wmf_traffic.spur_feed by snapshot and each partition is split in 8 parquet files. They are views of Spur feeds anonymous-residential.
https://docs.spur.us/feeds/types?id=custom-feeds&utm_source=chatgpt.com#anonymous-residential-feed all bad actors.

snapshot        count(1)
20250430        56142741
20250530        57991550
20250918        60518449
Oct 14 2025, 10:14 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 13 2025

Antoine_Quhen added a comment to T405944: Rewrite wmf_content.mediawiki_content_*_v1 tables with a new column for origin_rev_id.

Done. That took ~3.5h and ~1h respectively.

Oct 13 2025, 4:19 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), DPE-Mediawiki-Content
Antoine_Quhen edited projects for T407104: SDS 1.3.6 Import 2 new datasets of Spurs in a parquet table, added: Data-Engineering (Q2 FY25/26 October 1st - December 31th); removed Data-Engineering.
Oct 13 2025, 3:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Antoine_Quhen moved T407103: SDS 1.3.6 SPUR bot detection analysis from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 13 2025, 10:01 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)