As an initial version of the sample content dataset, we want to have a dataset available in Superset to answer potential content related questions around US Election.
- Last 12 months of data, daily granularity
- Top 500 pages by page views over the past year (initially top 3 wikis by traffic from US: enwiki, zhwiki, eswiki)
- Topics as an array (from joining with isaacj.article_topics_outlinks_2020_09 on wiki_db and pageid)
- Page views (by access_method & agent_type)
- Log-transformed proportion of total views (easier to store than a tiny decimal)
- Edit count (by user_is_anonymous & user_is_bot)
We can then write Presto queries to make it explorable datasources and create a dashboard in Superset with following information:
- top articles daily/monthly with pageviews/edits topics by wikis
- pageviews and edits for politics and society related topics by wikis