Page MenuHomePhabricator

Better Use of Data
Closed, ResolvedPublic

Description

This is a parent task for the work to be done for the Better Use of Data Program, which was started in FY2018/19.

Roadmap FY2019-2020

Q1 (July - September)
Data Engineering
  • Event Platform Client Libraries prototypes
    • Develop Event Platform Client specification T228177
    • Planning Stream & Schema usage T228656
  • Prototype clients
    • Prototype Android client
    • Prototype iOS client
    • Prototype JS browser client
Data Access
  • Automated dashboard for Product Core Metrics (Readers)
  • Internal production release of edits_hourly Druid datasets (for use in Superset and Turnilo)
Data Training
  • Start product team trainings: best practices for working with data in the product development lifecycle
  • Start product team trainings for core metrics: data exploration and reports
Tracking
  • MEP stream configuration service planning
  • MEP schema registry deployment
  • Client-side error logging working group

Q2 (October - December)
Data Engineering
  • Provide a test-ready Modern Event Platform clients for MediaWiki, the Android Wikipedia app, and the iOS Wikipedia app
  • MEP stream configuration service has been deployed for analytics events T233634
  • MEP EventGate instance has been deployed for analytics events T236386, T233629
  • MEP client for MediaWiki has been tested with Vagrant T238544
  • Cross-platform client-side error logging T229442
Data Quality
  • Ensure Data Quality is considered as part of MEP T228228
  • Document plan for technical changes needed to improve data T236504
  • Document plan for process changes needed to improve data and present to Product Analytics team T235802
Data Access
  • Automated dashboard for Product Core Metrics (Edits)
Data Training
  • Group training on core metrics: data exploration and reports
  • Office hours with members from product teams: data exploration and reports
Tracking
  • MEP engineering sync
  • Client-side error logging working group

Q3 (January - March)
Data Engineering
  • Finish EPC for production
    • Develop production version of Sampling Controller
  • Document usage guidelines on-wiki
    • Recommendation for porting old schemas to new system
    • Document A/B testing procedures
    • Document funnel analysis procedures
  • Integration: EPC is available for use on the 3 major platforms
    • One (1) web team is able to use EPC for analytics
    • Android team is able to use EPC for analytics
    • iOS team is able to use EPC for analytics
  • Deploy error logging instrumentation to production T238544
Data Quality
  • Ensure Data Quality is considered during piloting of MEP Client Libraries
  • Pilot process changes needed to improve data with at least 1 Product Team T235802
Tracking
  • TBD

Q4 (April - June)
Data Engineering
  • Usage: EPC is used on the 3 major platforms
    • One (1) web team is using EPC for analytics
    • Android team is using EPC for analytics
    • iOS team is using EPC for analytics
  • Advise all newly-created schema use EventGate-style JSONSchema
  • Port select EventLogging schema to EventGate-style JSONSchema
  • Evaluate feasibility of cross-schema joins (comes with EPC automatically)
  • Develop Event Platform Client test suite T228178
  • Research and architect "session length" dataset (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength for ideas)
Tracking
  • TBD
Data Quality
  • [UNDER CONSIDERATION] Pilot technical changes needed to improve data T236504

Stretch Goals

  • Evaluate analytics systems capacity ("are we going to break our whole system by pinging from a lot of clients on a second-by-second basis?")
  • MEP for Product
    • Develop schema registry UI
    • Develop stream configuration service UI
    • Develop CI and commit hooks
  • Research and architect "unique devices" dataset
  • Develop automated ingestion pipeline and dashboard defaults

Event Timeline

kzimmerman updated the task description. (Show Details)
kzimmerman updated the task description. (Show Details)

Closing this as we have moved to Betterworks to track our work around Better Use of Data.

FY19-20 OKR

FY20-21 OKR