Page MenuHomePhabricator

[Data Quality] Implement basic data quality metrics for Unique Devices datasets
Open, MediumPublic

Description

Implement data quality checks on unique devices data similar to webrequest T349763 and mw_history T354692

Scope

  • Tables
    • unique_devices_per_domain_daily
    • unique_devices_per_domain_monthly
    • unique_devices_per_project_family_daily
    • unique_devices_per_project_family_monthly
  • The following metrics and checks should be implemented:
    • TBD
User Story
As a Data Analyst, I want to increase the quality and reliability of the unique devices datasets so that we can provide a more dependable annual plan metric.
Questions to answer (please add more as needed):
  • What are the processing steps involved?
  • Per each processing step:
    • What are the unit tests in place?
    • What is logged during the process step?
    • Where are the logs stored?
    • Are the logs visualized? If so where?
  • On a data set level
    • What other data quality checks do we have?
    • What alerts do we send? Where do they go? What system sends the alert?
    • What are the upstream dependencies?
    • How do we determine the completeness of the dataset
Deliverable
  1. Document with all the above information
  2. A data quality dashboard for unique devices (similar to https://superset.wikimedia.org/superset/dashboard/p/3j5vKX2BYzl/ )

Event Timeline

@lbowmaker / @Ahoelzl : I made this task as a placeholder for upcoming work we are going to be doing for Relevance metric data governance. I'll schedule time for us to discuss further in the coming weeks.

Thanks @Mayakp.wiki!

Please see this ticket and docs where we did some investigation into the unique devices data pipeline flow and data quality checks:

https://phabricator.wikimedia.org/T347706

Happy to discuss and define what checks we could implement but I don’t think we will get to implement anything this FY. However, from what I’ve seen from the SDS objectives for next year I think we will have scope to implement as part of that.

great, thanks @lbowmaker for reminding me about that task! will review it.
yes, its ok to not implement something this FY, but having a good idea about what is possible and what we should have, is a good starting point for the remainder of this FY.

Question - is there a backlog where we can add the implementation hypothesis so that we can plan accordingly for next FY ?

I moved this ticket to our Data Quality column which we will review and prioritize based on the KR’s for next year.

I think we could use this ticket as the Epic and have subtasks like: define and document checks we want, implement checks, etc (we can tie those tickets to hypotheses)