Plan for BUOD - Data Quality : additional set of tables that have gone through quality checks plus another table for errors
To implement this, first do a POC by building a table that has gone thru the data quality checks (suggested candidate : MobileWikiAppSessions table) that are defined for it. Resulting table will be a subset of existing Hive table.
Things to consider :
- Work out the questions for proof of concept
- Do a lot of research around data volume
- How are you going to log those errors?
- What rules to identify errors?
- What is the cadence for replication?
- Start with a static implementation (snapshot)
Then present this proof of concept to Analytics team to discuss buy-in before the end of Q2.
Other things to consider: privacy implications/duplicating IPs, etc.
**MobileWikiAppSessions: some issues that have come up are things like duplicated events; sessions that have negative session length.