Page MenuHomePhabricator

[Analytics] Add the published datasets directories as a target for the REST API Airflow jobs
Closed, ResolvedPublic

Description

In T341330: [Analytics] Airflow implementation of unique IPs accessing Wikidata's REST API metrics WMDE Analytics created its first Airflow DAG and the needed jobs for it. As a requirement for T360298: [Analytics] Public dashboard pilot it seems that another step would be needed in order to have the data be on a publicly available dashboard - specifically that we need to add the published datasets as a target of the jobs such that the data is saved to HDFS and in TSV format in a place where it can be ingested by a dashboarding software like Turnilo.

Note on this: if we do publish the data, we need to check the Data Publication Guidelines and make sure that it's Tier 3 Low Risk or Tier 2 and sanitized. It also needs to be logged via this form.

Event Timeline

Note that this task is dependent on whether a standardized system that would not require the published datasets is created. Such a system is discussed in T361214: Data Platform - Public dashboard support.

Moved this to In progress as I'm adding the job to export everything to the published datasets folder to the DAG as I work on the same for T362849.

Note that MR#700 has been opened that has the work for this :)

AndrewTavis_WMDE changed the task status from Stalled to Open.Jun 6 2024, 12:31 PM

Unstalled as the plan for the data export has been approved in T365699 :)