==== Context
We want the Image Recommendation data pipeline to respect system and data quality SLOs. The system we are developing is coupled with data generated by external processes (users interaction, mysql dumps, analytics data pipelines). While we should strive to proper unit and integration testing to ensure correctness of our code, there’s a category of failure scenarios that will require introspection, instrumentation and analysis of the system.
==== Acceptance Criteria
**System Metrics**
Spark sinks
[] in/out records
[] CPU Usage
[] Memory Usage
[] Executor Counts
[] Run Time
**Dataset**
[] Summary of population statistics
[] Size and counts of intermediate and final datasets
==== Subtasks
[]
==== Out of Scope
[] Collecting Android's Feedback of user judgments and incorporating back into the algo
==== Open Questions
- Are we dropping records from one stage of the pipeline to another?
- Are there unexpected/malformed inputs?
- Do we see significant changes in input/output compared to previous runs?