Page MenuHomePhabricator

MVP staging topic dataset for use in Superset
Closed, ResolvedPublic

Description

Deliverable: a table in Hive or in MySQL "staging" database which has page views, edit counts, and topics for pages

The initial version of the dataset (with the initial goal of being used to answer questions around US Election) should have:

  • Last 12 months of data, daily granularity
  • Top 500 pages by page views over the past year (initially top 3 wikis by traffic from US: enwiki, zhwiki, eswiki)
  • Topics as an array (from joining with isaacj.article_topics_outlinks_2020_09 on wiki_db and pageid)
  • Main & Sub-topics as arrays (from joining with cchen.topic_component)
  • Page views (by access_method & agent_type)
  • Log-transformed proportion of total views (easier to store than a tiny decimal)
  • Edit count (by user_is_anonymous & user_is_bot)

We can then write Presto queries to make it explorable datasources in Superset

Event Timeline

@cchen: For pageviews and edit counts, what do you think about storing them in separate columns like views_desktop_user, views_mobileweb_spider, views_mobileapp_automated (and other combinations), edits_anon, edits_registered_user, edits_registered_bot?

I think it would be easier to query than if we had complex structures inside just two views & edits columns

mpopov updated the task description. (Show Details)
kzimmerman subscribed.

Prioritizing as high since we'd like to have data available for people to explore around the US Election

Decisions from my chat with @cchen:

  • Keep the dimensions from edits & views datasets as dimensions, yielding multiple rows per page per day (different combinations) as that will make addition of more dimensions easier in the future
  • Start with 1 month of data and 100 pages (since we would have more than 1 row per page per day) to see how Presto handles it
  • Increase volume of data (months & # of pages) iteratively

Decisions from my chat with @cchen:

  • Keep the dimensions from edits & views datasets as dimensions, yielding multiple rows per page per day (different combinations) as that will make addition of more dimensions easier in the future
  • Start with 1 month of data and 100 pages (since we would have more than 1 row per page per day) to see how Presto handles it
  • Increase volume of data (months & # of pages) iteratively

Oh wait, that doesn't make any sense. Edit counts & view counts are two separate metrics with completely separate dimensions. We have to do this:

separate columns like views_desktop_user, views_mobileweb_spider, views_mobileapp_automated (and other combinations), edits_anon, edits_registered_user, edits_registered_bot

IF we want a single table. Alternatively we could have two tables (view counts & edit counts).

@cchen please review the queries at https://github.com/wikimedia-research/CDA-MVP (the top viewed query in this notebook and the two queries which populate the bearloga.cda_views and bearloga.cda_edits tables)

I've created a demo Superset dashboard (with Presto queries included) to illustrate how one might work with those two tables. I've added you as an owner on it so you can use that for writing other queries and making other charts as a way of seeing if the specification is missing anything or should be different.

@mpopov the queries in notebook and Superset looks good!

I created a top viewed pages dashboard with the dataset. I am still playing with the topic related dimensions and will add more topic related charts after.