Page MenuHomePhabricator

[Analytics] Add the published datasets directories as a target for the REST API Airflow jobs
Open, Needs TriagePublic

Description

In T341330: [Analytics] Airflow implementation of unique ips accessing Wikidata's REST API metrics WMDE Analytics created its first Airflow DAG and the needed jobs for it. As a requirement for T360298: [Analytics] Public dashboard pilot it seems that another step would be needed in order to have the data be on a publicly available dashboard - specifically that we need to add the published datasets as a target of the jobs such that the data is saved to HDFS and in TSV format in a place where it can be ingested by a dashboarding software like Turnilo.

Note on this: if we do publish the data, we need to check the Data Publication Guidelines and make sure that it's Tier 3 Low Risk or Tier 2 and sanitized. It also needs to be logged via this form.

Event Timeline

Note that this task is dependent on whether a standardized system that would not require the published datasets is created. Such a system is discussed in T361214: Public dashboard process.

Moved this to In progress as I'm adding the job to export everything to the published datasets folder to the DAG as I work on the same for T362849.