Page MenuHomePhabricator

Define and Create Logging Routines - Airflow UI
Closed, DeclinedPublic

Description

User Story
As a platform engineer, I need to define and implement logging processes so that I can easily determine causes of failures
Success Criteria
  • Logs are generated from runs of the use case and can be easily viewed in the Airflow UI
  • Stack traces of errors can be viewed in Airflow UI
  • Persistence of Airflow logs is understood
Observability criteria

(@lbowmaker adding this section for context. It feels a bit of scope creep for this task though. How would you feel about repurposing this epic https://phabricator.wikimedia.org/T275165, and making 292747 a subset of Observability? ).

We want data pipeline to respect system and data quality SLOs. The systems develop are coupled with data generated by external processes (users interaction, mysql dumps, analytics data pipelines). While we should strive to proper unit and integration testing to ensure correctness of our code, there’s a category of failure scenarios that will require introspection, instrumentation and analysis of the system.

We should log obvious failures such as stacktraces or exceptions thrown, as well as have the capability to track debug messages.

Additionally we’ll need to track two types of metrics:

  • System
    • Spark sinks:
    • in / out records
    • cpu usage
    • memory usage
    • executor counts
    • run time
  • Dataset
    • Data transformations pre/post conditions. We should keep track of size and counts of intermediate and final datasets. The purpose is to guard against data issues and identify regressions.
    • Summary of population statistics. The purpose is to identify regressions, population/model drift, anomaly detection.
Open questions / remarks
  • We could distinguish between enabling airflow task logging vs adding ad hoc logs and facilities to the pipeline. Which should be in scope for this iteration?
  • For this iteration, would it be sufficient to expose logs via Airflow UI?
  • Which events do we ship to a metrics collector rather than logging?
  • Where do we store logs? Currently we can access airflow logs, should we ship them to ES?
Possibly related tasks

Event Timeline

@gmodena - my thoughts for this task was to do something simple to also support a basic use case for a dataset producer/platform engineer. For example, 'I expected my Airflow DAG to generate a file, I can't see the file so I'll check for any errors in Airflow and the UI will show me the stack trace'.

So in scope for this could be something as simple as:

  • Stack trace of code failures (anytime exception is thrown)
  • Strack trace and run can be viewed in Airflow UI
  • Logs/runs are deleted after X days

Seems https://phabricator.wikimedia.org/T275165 is done? Is that included in the latest version of the algorithm after refactoring?

@gmodena - my thoughts for this task was to do something simple to also support a basic use case for a dataset producer/platform engineer.

Thanks for clarifying. The following we should already have "for free" with the current scheduler deployment:

  • Stack trace of code failures (anytime exception is thrown)
  • Strack trace and run can be viewed in Airflow UI

For the following:

  • Logs/runs are deleted after X days

I'll need to validate if we can configure retention/rotation policies ourselves, or would need to interface with SRE. I'll get back to you on this. Either way, not a problem.

Seems https://phabricator.wikimedia.org/T275165 is done? Is that included in the latest version of the algorithm after refactoring?

We do have logic for generating those metrics/reports manually, but we do not orchestrate it in the data pipeline yet.

lbowmaker renamed this task from Define and Create Logging Routines to Define and Create Logging Routines - Airflow UI.Oct 27 2021, 6:54 PM
lbowmaker updated the task description. (Show Details)
lbowmaker moved this task from Backlog to Ready/Groomed 📚 on the Generated Data Platform board.