User Story
As a platform engineer, I need to define and implement logging processes so that I can easily determine causes of failures
Success Criteria
- Logs are generated from runs of the use case and can be easily viewed in the Airflow UI
- Stack traces of errors can be viewed in Airflow UI
- Persistence of Airflow logs is understood
Observability criteria
(@lbowmaker adding this section for context. It feels a bit of scope creep for this task though. How would you feel about repurposing this epic https://phabricator.wikimedia.org/T275165, and making 292747 a subset of Observability? ).
We want data pipeline to respect system and data quality SLOs. The systems develop are coupled with data generated by external processes (users interaction, mysql dumps, analytics data pipelines). While we should strive to proper unit and integration testing to ensure correctness of our code, there’s a category of failure scenarios that will require introspection, instrumentation and analysis of the system.
We should log obvious failures such as stacktraces or exceptions thrown, as well as have the capability to track debug messages.
Additionally we’ll need to track two types of metrics:
- System
- Spark sinks:
- in / out records
- cpu usage
- memory usage
- executor counts
- run time
- Dataset
- Data transformations pre/post conditions. We should keep track of size and counts of intermediate and final datasets. The purpose is to guard against data issues and identify regressions.
- Summary of population statistics. The purpose is to identify regressions, population/model drift, anomaly detection.
Open questions / remarks
- We could distinguish between enabling airflow task logging vs adding ad hoc logs and facilities to the pipeline. Which should be in scope for this iteration?
- For this iteration, would it be sufficient to expose logs via Airflow UI?
- Which events do we ship to a metrics collector rather than logging?
- Where do we store logs? Currently we can access airflow logs, should we ship them to ES?