Deliverable: a table in Hive or in MySQL "staging" database which has page views, edit counts, and topics for pages
The initial version of the dataset (with the initial goal of being used to answer questions around US Election) should have:
- Last 12 months of data, daily granularity
- Top 500 pages by page views over the past year (initially top 3 wikis by traffic from US: enwiki, zhwiki, eswiki)
- Topics as an array (from joining with isaacj.article_topics_outlinks_2020_09 on wiki_db and pageid)
- Main & Sub-topics as arrays (from joining with cchen.topic_component)
- Page views (by access_method & agent_type)
- Log-transformed proportion of total views (easier to store than a tiny decimal)
- Edit count (by user_is_anonymous & user_is_bot)
We can then write Presto queries to make it explorable datasources in Superset