Page MenuHomePhabricator

Editors dataset in Turnilo / Superset
Closed, DeclinedPublic

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Editors dataset in Turnilo / Superset

  • Request Description: Turnilo and Superset are valuable tools for data analysts and product managers to measure their work and key results. The pageviews and edits datasets are heavily used, but there is a critical dataset missing: editors. Many teams have key results for which the unit of analysis is the user (as opposed to the pageview or the edit). The Growth team's KRs tend to look at editor retention, or counts of editors who have been activated. The Editing team has KRs around the numbers of contributors who use their features. These numbers cannot be produced with our dashboarding tools; rather the teams rely on the data analysts to pull them manually with SQL. If a dataset at the editor level were available, product teams would have tremendously more insight into their outcomes. Construction of such a dataset would require choices on which aggregates to attach to the editor, and the business rules for making them. Product managers and data analysts would be able to assist in developing those.
  • Indicate Priority Level: High
  • Main Requestors: Growth, Editing, and Web teams
  • Ideal Delivery Date: August 2022
  • Stakeholders: Marshall Miller

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT230092: Data exploration capabilities in Superset and Turnilo for editing data at “editor” level
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

Scoping this down to logged in users for an initial MVP

@EChetty wanted to note that @Mayakp.wiki previously worked with Connie to identify and record user requirements for a previous version of this request (see the referenced epic T230092). I think some of the engineering dependencies were close to a resolution, although work from my team's side was paused when we switched our focus to Session Length in early 2021. (I am sorry that we did not close the loop in the tickets.)

If this work gets underway again, I think it needs analytics support from the data stewardship & requirements side of things.

@EChetty, @Milimetric , @MMiller_WMF , @Iflorez and I met today to discuss use cases, existing datasets, limitations and scope of this project.
PA & DE will re group in the coming weeks to discuss the initial table design.

We had sessions with Data Engineering and Product Analytics where we discussed the design for the table. On a high level we are going to build a table derived from mediawiki_history and mw_user_history that calculates data in an easier way for product managers and analysts to query and dashboard. The table may not be on Druid (coz Druid has a few limitations as we know).  We will focus on building it in Hive, and accessible via Presto.
The table will provide editors activity on a daily basis and will be refreshed monthly (for now; we will think about making it real time in the future). For the mvp we are not focussing on retention specific use cases, but having editors data on a daily basis would help calculate basic retention. Details are given in the meeting doc.

Timeline update: It would be good if we could focus on building a good design for this table by the end of this quarter(Q1). By Sep end my goal is to have say a test table with sample data, which the analysts and PMs can query and provide feedback so we can think about the usability before we build an actual pipeline.
I am also trying my hand at making something similar to [Equity Landscape] Editorship Metrics - Transformation Pipeline design.

@Milimetric and I are working on creating the table query and generating some sample data for testing and charting.
We are going thru multiple iterations of review and feedback. We have a query that gets us data for the first 7 use cases. We will add the remaining use cases to this query.
Thank you @Milimetric for all your support on this!!

The use cases and table design are captured in the requirements document.
Created sample data for the use cases (See P40208). We were able to map most of the columns using mediawiki_history. However, I didn't get a chance to test it out or share with the team and stakeholders.
Pending:

  • user_property_experiment_group : we only have 4 user properties in wmf_raw.mediawiki_user_properties. There will be a new Foundation Tech Request to sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties
  • creates_new_page: phase 2, perhaps?

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

This is news to me, I was wondering what happened to this work, I was having fun with Maya bringing this together.

So... maybe I'm missing something but the mediawiki.page-change stream work would be totally orthogonal to this work. The stream would bring us data faster, but we would transform it into the same shape as we do now, using batch. We wouldn't want to fork out that transformation and do it in two places. So, at most, mediawiki.page-change would let us have data faster. Happy to talk about any misunderstanding here, ping @EChetty

Following a clarification conversation with @Milimetric

The work required to fulfil the proposed stop gap solution is not worth redirecting work from the page state change work on events platform and this will remain deprioritised for now.

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

This is news to me, I was wondering what happened to this work, I was having fun with Maya bringing this together.

So... maybe I'm missing something but the mediawiki.page-change stream work would be totally orthogonal to this work. The stream would bring us data faster, but we would transform it into the same shape as we do now, using batch. We wouldn't want to fork out that transformation and do it in two places. So, at most, mediawiki.page-change would let us have data faster. Happy to talk about any misunderstanding here, ping @EChetty

Removing inactive assignee (please do so as part of team offboarding!).

Im inclined towards declining this task since -

  1. they've been open for a while
  2. we are re-thinking our strategy for Contributors metric since it is now a core annual plan metric. @OSefu-WMF is there a task I can link here for the work you are doing?

There is a shadow worsktream going on with Marshall focused on Contributors metric measurement strategy. How to transition things from Product to the wider Movement? Marshall is also trying to find the right answer here.

No phab task. Agreed on declining.