Change Details

1. Implement [[ https://www.wikidata.org/wiki/User:Envlh/Denelezh/Schema | new schema ]]: - transformed KPI table (rename: metric or indicator) - pre-work is to list all metrics we need and see how they would be stored in "long" format - alternatives to alembic? 2. WDTK layer: - Start from denelezh-import, include all current properties - include the properties needed for WHGI 3. Backfiller - WHGI from old index files - Denelezh from old db dumps - investigate how many old 4. Orchestration - Use airflow, with a dag: - allow for both 1. Run WDTK → create CSVs → load data into DB → make aggregation 2. Run WDTK → create CSVs → make aggregations in memory → load data into DB - alow for paraellized transformations (replace sql permuations in loop) - see if wikipedia cloud has a spark cluster already? - they do https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark 5. dev-ops - setup a mysql 8.0 db, configure with envel - on a seperate or new server on wmf-cloud