Editors dataset in Turnilo / Superset
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	MMiller_WMF
	May 9 2022, 3:53 AM

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Editors dataset in Turnilo / Superset

Request Description: Turnilo and Superset are valuable tools for data analysts and product managers to measure their work and key results. The pageviews and edits datasets are heavily used, but there is a critical dataset missing: editors. Many teams have key results for which the unit of analysis is the user (as opposed to the pageview or the edit). The Growth team's KRs tend to look at editor retention, or counts of editors who have been activated. The Editing team has KRs around the numbers of contributors who use their features. These numbers cannot be produced with our dashboarding tools; rather the teams rely on the data analysts to pull them manually with SQL. If a dataset at the editor level were available, product teams would have tremendously more insight into their outcomes. Construction of such a dataset would require choices on which aggregates to attach to the editor, and the business rules for making them. Product managers and data analysts would be able to assist in developing those.
Indicate Priority Level: High
Main Requestors: Growth, Editing, and Web teams
Ideal Delivery Date: August 2022
Stakeholders: Marshall Miller

Request Documentation

Document Type	Required?	Document/Link
Related PHAB Tickets	Yes	T230092: Data exploration capabilities in Superset and Turnilo for editing data at “editor” level
Product One Pager	Yes	<add link here>
Product Requirements Document (PRD)	Yes	<add link here>
Product Roadmap	No	<add link here>
Product Planning/Business Case	No	<add link here>
Product Brief	No	<add link here>
Other Links	No	<add links here>

Related Objects

Mentioned In: T256050: Add dimensions to editors_daily dataset
T327247: Add graph for number of users in the unified stats dashboard
T256719: Add editors_monthly data to Druid
T316896: Review why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily
Mentioned Here: P40208 editors_dataset_sample_data
T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus
T230092: Data exploration capabilities in Superset and Turnilo for editing data at “editor” level

Event Timeline

MMiller_WMF created this task.May 9 2022, 3:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 9 2022, 3:53 AM

MMiller_WMF updated the task description. (Show Details)May 9 2022, 3:55 AM

MMiller_WMF added subscribers: kzimmerman, Mayakp.wiki, cchen and 5 others.

Mayakp.wiki added a project: Product-Analytics.May 9 2022, 9:26 PM

Mayakp.wiki added a subscriber: • EChetty.

mpopov moved this task from Triage to Tracking on the Product-Analytics board.May 10 2022, 5:21 PM

Scoping this down to logged in users for an initial MVP

@EChetty wanted to note that @Mayakp.wiki previously worked with Connie to identify and record user requirements for a previous version of this request (see the referenced epic T230092). I think some of the engineering dependencies were close to a resolution, although work from my team's side was paused when we switched our focus to Session Length in early 2021. (I am sorry that we did not close the loop in the tickets.)

If this work gets underway again, I think it needs analytics support from the data stewardship & requirements side of things.

Iflorez subscribed.May 19 2022, 8:20 PM

• EChetty moved this task from Backlog to Investigate on the Foundational Technology Requests board.May 25 2022, 4:43 PM

odimitrijevic subscribed.Jun 22 2022, 3:01 PM

@EChetty, @Milimetric , @MMiller_WMF , @Iflorez and I met today to discuss use cases, existing datasets, limitations and scope of this project.
PA & DE will re group in the coming weeks to discuss the initial table design.

nshahquinn-wmf subscribed.Aug 10 2022, 10:37 PM

We had sessions with Data Engineering and Product Analytics where we discussed the design for the table. On a high level we are going to build a table derived from mediawiki_history and mw_user_history that calculates data in an easier way for product managers and analysts to query and dashboard. The table may not be on Druid (coz Druid has a few limitations as we know). We will focus on building it in Hive, and accessible via Presto.
The table will provide editors activity on a daily basis and will be refreshed monthly (for now; we will think about making it real time in the future). For the mvp we are not focussing on retention specific use cases, but having editors data on a daily basis would help calculate basic retention. Details are given in the meeting doc.

Timeline update: It would be good if we could focus on building a good design for this table by the end of this quarter(Q1). By Sep end my goal is to have say a test table with sample data, which the analysts and PMs can query and provide feedback so we can think about the usability before we build an actual pipeline.
I am also trying my hand at making something similar to [Equity Landscape] Editorship Metrics - Transformation Pipeline design.

Iflorez mentioned this in T316896: Review why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily.Sep 1 2022, 8:29 PM

@Milimetric and I are working on creating the table query and generating some sample data for testing and charting.
We are going thru multiple iterations of review and feedback. We have a query that gets us data for the first 7 use cases. We will add the remaining use cases to this query.
Thank you @Milimetric for all your support on this!!

The use cases and table design are captured in the requirements document.
Created sample data for the use cases (See P40208). We were able to map most of the columns using mediawiki_history. However, I didn't get a chance to test it out or share with the team and stakeholders.
Pending:

user_property_experiment_group : we only have 4 user properties in wmf_raw.mediawiki_user_properties. There will be a new Foundation Tech Request to sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties
creates_new_page: phase 2, perhaps?

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

kzimmerman merged a task: T230092: Data exploration capabilities in Superset and Turnilo for editing data at “editor” level.Jan 9 2023, 7:30 PM

kzimmerman mentioned this in T256719: Add editors_monthly data to Druid.

kzimmerman added subscribers: JAnstee_WMF, • CMacholan, RHo, • JKatzWMF.

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

This is news to me, I was wondering what happened to this work, I was having fun with Maya bringing this together.

So... maybe I'm missing something but the mediawiki.page-change stream work would be totally orthogonal to this work. The stream would bring us data faster, but we would transform it into the same shape as we do now, using batch. We wouldn't want to fork out that transformation and do it in two places. So, at most, mediawiki.page-change would let us have data faster. Happy to talk about any misunderstanding here, ping @EChetty

Following a clarification conversation with @Milimetric

The work required to fulfil the proposed stop gap solution is not worth redirecting work from the page state change work on events platform and this will remain deprioritised for now.

In T307883#8559177, @Milimetric wrote:

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

This is news to me, I was wondering what happened to this work, I was having fun with Maya bringing this together.

So... maybe I'm missing something but the mediawiki.page-change stream work would be totally orthogonal to this work. The stream would bring us data faster, but we would transform it into the same shape as we do now, using batch. We wouldn't want to fork out that transformation and do it in two places. So, at most, mediawiki.page-change would let us have data faster. Happy to talk about any misunderstanding here, ping @EChetty

MNeisler mentioned this in T327247: Add graph for number of users in the unified stats dashboard.Feb 22 2023, 12:28 AM

Removing inactive assignee (please do so as part of team offboarding!).

nshahquinn-wmf added a project: Movement-Insights.Aug 17 2023, 1:02 AM

JAnstee_WMF moved this task from Incoming to Backlog on the Movement-Insights board.Aug 22 2023, 6:27 PM

Mayakp.wiki triaged this task as Medium priority.Aug 22 2023, 6:29 PM

Im inclined towards declining this task since -

they've been open for a while
we are re-thinking our strategy for Contributors metric since it is now a core annual plan metric. @OSefu-WMF is there a task I can link here for the work you are doing?

There is a shadow worsktream going on with Marshall focused on Contributors metric measurement strategy. How to transition things from Product to the wider Movement? Marshall is also trying to find the right answer here.

Mayakp.wiki mentioned this in T256050: Add dimensions to editors_daily dataset.Apr 18 2024, 12:14 AM

No phab task. Agreed on declining.

Mayakp.wiki closed this task as Declined.Apr 18 2024, 12:40 AM

Editors dataset in Turnilo / SupersetClosed, DeclinedPublicActions