In order to develop the Superset dashboards, we need some scraper data—ideally several months of it. To unblock that work, we will do the following:
- Make sure we have write access to the Hive wmde schema.
- Use the table create script from T412019: [Epic] Schedule scraper and aggregations as an Airflow job to add the wiki_page_cite_references_monthly table to this schema.
-
Insert some historical data from eg. https://docs.google.com/spreadsheets/d/1w1WE8sGfZfIt6gJEY_9wAoxJoYl7-NnCWrSion_CMSs/edit?gid=84389259#gid=84389259 so we have a sample to work with. - Use data from the aggregation script https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/20 as an example
Outcome
I ran the aggregation script three times to create "mock" data with three datapoints. The data can be accessed from Superset now using presto_analytics_iceberg and the wmde_fisch schema.
e.g.
SELECT * FROM "wmde_fisch"."wiki_page_cite_references_monthly"