Page MenuHomePhabricator

[Analytics] [Request] DAG that calculates usage of Wikidata changes preference
Open, Stalled, Needs TriagePublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create tasks for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

NOTE: This task is blocked by T360296 where the MariaDB to data lake process will be developed
NOTE: This task resolves T391813

In T390888: [Analytics] Tracking How Many People Turn On The Preference for Wikidata Changes in their Wikipedias over time? we derived the baseline usage of the Wikidata changes preference across Wikipedias. It would be great if we had an automated process to get these numbers.

Specific Results

Please detail the specific results that the task should deliver.

An Airflow DAG that derives the distinct and per-Wikipedia stats of the usage of this preference on a monthly basis.

Desired Outputs

Please list the desired outputs of this task.

  • Queries to calculate the statistics and create a table in the data lake
  • A process to move the data from MariaDB to the data lake
  • A DAG to run the above jobs at the given interval

Deadline

Please make the time sensitivity of this request clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Use MariaDB process developed in T360296
  • Write Iceberg table create table script
  • Create a Iceberg table in Hive/HDFS within the wmde namespace
  • Convert monthly stats query to production Airflow query
  • Generate testing table generation and query scripts
  • Write DAG to run job query
  • Write DAG tests
  • Run tests on process as possible within time limitations
  • Deploy DAG

Estimation

Estimate:
Actual:

Data

The tables that will be referenced in this task.

  • user_properties within MariaDB

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

AndrewTavis_WMDE changed the task status from Open to Stalled.Apr 9 2025, 2:52 PM

Stalling this task as we need to figure out how to get MariaDB data into the data lake for beforehand, and this will likely be done in T360296 :)

AndrewTavis_WMDE changed the task status from Stalled to Open.Sep 17 2025, 1:51 PM
AndrewTavis_WMDE changed the task status from Open to Stalled.Nov 18 2025, 12:31 PM

Stalling this as we're investigating whether the property can be brought into the Data Lake in T323456.