Page MenuHomePhabricator

[Data Quality] Implement basic data quality metrics for Unique Devices datasets
Open, MediumPublic


Implement data quality checks on unique devices data similar to webrequest T349763 and mw_history T354692


  • Tables
    • unique_devices_per_domain_daily
    • unique_devices_per_domain_monthly
    • unique_devices_per_project_family_daily
    • unique_devices_per_project_family_monthly
  • The following metrics and checks should be implemented:
    • TBD
User Story
As a Data Analyst, I want to increase the quality and reliability of the unique devices datasets so that we can provide a more dependable annual plan metric.
Questions to answer (please add more as needed):
  • What are the processing steps involved?
  • Per each processing step:
    • What are the unit tests in place?
    • What is logged during the process step?
    • Where are the logs stored?
    • Are the logs visualized? If so where?
  • On a data set level
    • What other data quality checks do we have?
    • What alerts do we send? Where do they go? What system sends the alert?
    • What are the upstream dependencies?
    • How do we determine the completeness of the dataset
  1. Document with all the above information
  2. A data quality dashboard for unique devices (similar to )

Event Timeline

@lbowmaker / @Ahoelzl : I made this task as a placeholder for upcoming work we are going to be doing for Relevance metric data governance. I'll schedule time for us to discuss further in the coming weeks.


Please see this ticket and docs where we did some investigation into the unique devices data pipeline flow and data quality checks:

Happy to discuss and define what checks we could implement but I don’t think we will get to implement anything this FY. However, from what I’ve seen from the SDS objectives for next year I think we will have scope to implement as part of that.

great, thanks @lbowmaker for reminding me about that task! will review it.
yes, its ok to not implement something this FY, but having a good idea about what is possible and what we should have, is a good starting point for the remainder of this FY.

Question - is there a backlog where we can add the implementation hypothesis so that we can plan accordingly for next FY ?

I moved this ticket to our Data Quality column which we will review and prioritize based on the KR’s for next year.

I think we could use this ticket as the Epic and have subtasks like: define and document checks we want, implement checks, etc (we can tie those tickets to hypotheses)