This is a parent task for the work to be done for the Better Use of Data Program, which was started in FY2018/19.
Roadmap FY2019-2020
Q1 (July - September)
Data Engineering
- Event Platform Client Libraries prototypes
- Prototype clients
- Prototype Android client
- Prototype iOS client
- Prototype JS browser client
Data Access
- Automated dashboard for Product Core Metrics (Readers)
- Internal production release of edits_hourly Druid datasets (for use in Superset and Turnilo)
Data Training
- Start product team trainings: best practices for working with data in the product development lifecycle
- Start product team trainings for core metrics: data exploration and reports
Tracking
- MEP stream configuration service planning
- MEP schema registry deployment
- Client-side error logging working group
Q2 (October - December)
Data Engineering
- Provide a test-ready Modern Event Platform clients for MediaWiki, the Android Wikipedia app, and the iOS Wikipedia app
- MEP stream configuration service has been deployed for analytics events T233634
- MEP EventGate instance has been deployed for analytics events T236386, T233629
- MEP client for MediaWiki has been tested with Vagrant T238544
- Cross-platform client-side error logging T229442
Data Quality
- Ensure Data Quality is considered as part of MEP T228228
- Document plan for technical changes needed to improve data T236504
- Document plan for process changes needed to improve data and present to Product Analytics team T235802
Data Access
- Automated dashboard for Product Core Metrics (Edits)
Data Training
- Group training on core metrics: data exploration and reports
- Office hours with members from product teams: data exploration and reports
Tracking
- MEP engineering sync
- Client-side error logging working group
Q3 (January - March)
Data Engineering
- Finish EPC for production
- Develop production version of Sampling Controller
- Document usage guidelines on-wiki
- Recommendation for porting old schemas to new system
- Document A/B testing procedures
- Document funnel analysis procedures
- Integration: EPC is available for use on the 3 major platforms
- One (1) web team is able to use EPC for analytics
- Android team is able to use EPC for analytics
- iOS team is able to use EPC for analytics
- Deploy error logging instrumentation to production T238544
Data Quality
- Ensure Data Quality is considered during piloting of MEP Client Libraries
- Pilot process changes needed to improve data with at least 1 Product Team T235802
Tracking
- TBD
Q4 (April - June)
Data Engineering
- Usage: EPC is used on the 3 major platforms
- One (1) web team is using EPC for analytics
- Android team is using EPC for analytics
- iOS team is using EPC for analytics
- Advise all newly-created schema use EventGate-style JSONSchema
- Port select EventLogging schema to EventGate-style JSONSchema
- Evaluate feasibility of cross-schema joins (comes with EPC automatically)
- Develop Event Platform Client test suite T228178
- Research and architect "session length" dataset (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength for ideas)
Tracking
- TBD
Data Quality
- [UNDER CONSIDERATION] Pilot technical changes needed to improve data T236504
Stretch Goals
- Evaluate analytics systems capacity ("are we going to break our whole system by pinging from a lot of clients on a second-by-second basis?")
- MEP for Product
- Develop schema registry UI
- Develop stream configuration service UI
- Develop CI and commit hooks
- Research and architect "unique devices" dataset
- Develop automated ingestion pipeline and dashboard defaults