Page MenuHomePhabricator

Proof of Concept Table for Data Quality
Closed, DeclinedPublic

Description

Plan for BUOD - Data Quality : additional set of tables that have gone through quality checks plus another table for errors
To implement this, first do a POC by building a table that has gone thru the data quality checks (suggested candidate : MobileWikiAppSessions table) that are defined for it. Resulting table will be a subset of existing Hive table.

Things to consider :

  • Work out the questions for proof of concept
  • Do a lot of research around data volume
  • How are you going to log those errors?
  • What rules to identify errors?
  • What is the cadence for replication?
  • Start with a static implementation (snapshot)

Then present this proof of concept to Analytics team to discuss buy-in before the end of Q2.

Other things to consider: privacy implications/duplicating IPs, etc.

**MobileWikiAppSessions: some issues that have come up are things like duplicated events; sessions that have negative session length.

Event Timeline

Steps for Data POC : as discussed with Jason / Mikhail during PA offsite

  1. Create custom table in db.
  2. Take a snapshot of data from MobileWikiAppSessions (3 months) and insert it to table created in step 1
  3. Write functions in python that should be applied to each row in this table. If it passes then extract this row and insert into new table in db
  4. If it fails then design error table where info of the errored or failed records will be available
  5. After row by row checks write functions in python to apply statistical checks on the data, such as plot distribution of the values, MAD, IQR etc. The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR.
LGoto triaged this task as Medium priority.Apr 10 2020, 4:40 PM
LGoto moved this task from Triage to Backlog on the Product-Analytics board.

I believe this task may no longer be required based on the work being planned by the Metrics Platform team. I will check with @kzimmerman before I make changes to the task.

Confirmed with Kate on 02-22-2022 : Declining this task