Data processing pipelines (Airflow) to be created for the following metrics:
| Metric | Visualization type | Update frequency |
|---|---|---|
| Distinct participants registered in period | Line graph | Daily |
| Distinct participants registered in period w/accounts created in <previous 30days | Line graph | Daily |
| Distinct New events in period | Line graph | Daily |
| Distinct participants registered | Line graph | Monthly |
| Distinct organizers organizing events created in period | Line graph | Monthly |
- https://gitlab.wikimedia.org/repos/product-analytics/data-pipelines final destination
- DAG files https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics_product/dags?ref_type=heads final destination
Notes:
- Process reference ticket: T367953, T362612, T362615#10106942
- bearloga's etl-guide
- KCV Notes: MariaDB-based Airflow DAGs
- MariaDB Airflow pipelines
- simple airflow compilation
- Query: Distinct participants registered in period
- Query: Distinct organizers that joined as organizers in period
- Query: New events in period
- Query: Number of new accounts created in period; MariaDB query is drafted
- write create daily and monthly table hql files
- update files to ensure that:
- Tables that are aggregates should be suffixed with the period over which they were aggregated
- columns: bigint (64-bit integers) should be the default for any integer columns
- columns: wiki_id should be used for internal references to a particular 'Mediawiki database,
- columns: month should always be an integer in the range 1-12
- columns: year and month integer columns
- draft dag files
- draft job logic env given querying MariaDB (wikishared & central auth)
- Write unit tests
- Test files
- End to end testing
- MR into appropriate final locations