MVP staging topic dataset for use in Superset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mpopov
	Oct 28 2020, 9:15 PM

Description

Deliverable: a table in Hive or in MySQL "staging" database which has page views, edit counts, and topics for pages

The initial version of the dataset (with the initial goal of being used to answer questions around US Election) should have:

Last 12 months of data, daily granularity
Top 500 pages by page views over the past year (initially top 3 wikis by traffic from US: enwiki, zhwiki, eswiki)
Topics as an array (from joining with isaacj.article_topics_outlinks_2020_09 on wiki_db and pageid)
Main & Sub-topics as arrays (from joining with cchen.topic_component)
Page views (by access_method & agent_type)
Log-transformed proportion of total views (easier to store than a tiny decimal)
Edit count (by user_is_anonymous & user_is_bot)

We can then write Presto queries to make it explorable datasources in Superset

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		None	T298924 Superset - Product Analytics Canonical Dashboards, Reports, and Datasets
Open		kzimmerman	T234701 "Content" equivalent of pageviews daily or edits_hourly available to use in Turnilo and Superset
Duplicate		Mayakp.wiki	T255496 Identify stakeholders, gather requirements, and determine maintenance and ownership responsibilities for the content dataset
Open		None	T257636 Technical Requirements for Content dataset
Resolved		cchen	T258985 Initial Sample content dataset to explore and iterate
Resolved	Spike	cchen	T267364 Content dashboard for use in Superset around US Election
Resolved		cchen	T266714 MVP staging topic dataset for use in Superset

Event Timeline

mpopov created this task.Oct 28 2020, 9:15 PM

Restricted Application added a subscriber: Stang. · View Herald TranscriptOct 28 2020, 9:15 PM

@cchen: For pageviews and edit counts, what do you think about storing them in separate columns like views_desktop_user, views_mobileweb_spider, views_mobileapp_automated (and other combinations), edits_anon, edits_registered_user, edits_registered_bot?

I think it would be easier to query than if we had complex structures inside just two views & edits columns

Shizhao added a project: Chinese-Sites.Oct 29 2020, 1:50 AM

Shizhao moved this task from Backlog to Research on the Chinese-Sites board.

mpopov updated the task description. (Show Details)Oct 29 2020, 3:25 PM

mpopov updated the task description. (Show Details)

Prioritizing as high since we'd like to have data available for people to explore around the US Election

Decisions from my chat with @cchen:

Keep the dimensions from edits & views datasets as dimensions, yielding multiple rows per page per day (different combinations) as that will make addition of more dimensions easier in the future
Start with 1 month of data and 100 pages (since we would have more than 1 row per page per day) to see how Presto handles it
Increase volume of data (months & # of pages) iteratively

mpopov updated the task description. (Show Details)Nov 5 2020, 7:44 PM

mpopov added parent tasks: T266580: Support Connie with sample content dataset, T267364: Content dashboard for use in Superset around US Election .Nov 5 2020, 7:46 PM

mpopov removed a parent task: T266580: Support Connie with sample content dataset.

mpopov merged a task: T266580: Support Connie with sample content dataset.

mpopov added subscribers: Mayakp.wiki, Aklapper.

kzimmerman mentioned this in T260706: Update/repair Search A/B Test autoreporter.Nov 5 2020, 9:37 PM

mpopov updated the task description. (Show Details)Nov 6 2020, 5:04 PM

In T266714#6607288, @mpopov wrote:

Decisions from my chat with @cchen:

Keep the dimensions from edits & views datasets as dimensions, yielding multiple rows per page per day (different combinations) as that will make addition of more dimensions easier in the future

Start with 1 month of data and 100 pages (since we would have more than 1 row per page per day) to see how Presto handles it

Increase volume of data (months & # of pages) iteratively

Oh wait, that doesn't make any sense. Edit counts & view counts are two separate metrics with completely separate dimensions. We have to do this:

separate columns like views_desktop_user, views_mobileweb_spider, views_mobileapp_automated (and other combinations), edits_anon, edits_registered_user, edits_registered_bot

IF we want a single table. Alternatively we could have two tables (view counts & edit counts).

mpopov moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.Nov 6 2020, 7:22 PM

mpopov moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.Nov 10 2020, 6:05 PM

@cchen please review the queries at https://github.com/wikimedia-research/CDA-MVP (the top viewed query in this notebook and the two queries which populate the bearloga.cda_views and bearloga.cda_edits tables)

I've created a demo Superset dashboard (with Presto queries included) to illustrate how one might work with those two tables. I've added you as an owner on it so you can use that for writing other queries and making other charts as a way of seeing if the specification is missing anything or should be different.

@mpopov the queries in notebook and Superset looks good!

I created a top viewed pages dashboard with the dataset. I am still playing with the topic related dimensions and will add more topic related charts after.

cchen closed this task as Resolved.Nov 25 2020, 9:53 PM

Shizhao moved this task from Research to Closed on the Chinese-Sites board.Nov 26 2020, 2:30 AM

Stang unsubscribed.Nov 13 2021, 8:57 PM

MVP staging topic dataset for use in SupersetClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

MVP staging topic dataset for use in Superset
Closed, ResolvedPublic
Actions

Related Objects
Search...