Define and Create Logging Routines - Airflow UI
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	lbowmaker
	Oct 7 2021, 2:10 PM

Description

User Story

As a platform engineer, I need to define and implement logging processes so that I can easily determine causes of failures

Success Criteria

Logs are generated from runs of the use case and can be easily viewed in the Airflow UI
Stack traces of errors can be viewed in Airflow UI
Persistence of Airflow logs is understood

Observability criteria

(@lbowmaker adding this section for context. It feels a bit of scope creep for this task though. How would you feel about repurposing this epic https://phabricator.wikimedia.org/T275165, and making 292747 a subset of Observability? ).

We want data pipeline to respect system and data quality SLOs. The systems develop are coupled with data generated by external processes (users interaction, mysql dumps, analytics data pipelines). While we should strive to proper unit and integration testing to ensure correctness of our code, there’s a category of failure scenarios that will require introspection, instrumentation and analysis of the system.

We should log obvious failures such as stacktraces or exceptions thrown, as well as have the capability to track debug messages.

Additionally we’ll need to track two types of metrics:

System
- Spark sinks:
- in / out records
- cpu usage
- memory usage
- executor counts
- run time
Dataset
- Data transformations pre/post conditions. We should keep track of size and counts of intermediate and final datasets. The purpose is to guard against data issues and identify regressions.
- Summary of population statistics. The purpose is to identify regressions, population/model drift, anomaly detection.

Open questions / remarks

We could distinguish between enabling airflow task logging vs adding ad hoc logs and facilities to the pipeline. Which should be in scope for this iteration?
For this iteration, would it be sufficient to expose logs via Airflow UI?
Which events do we ship to a metrics collector rather than logging?
Where do we store logs? Currently we can access airflow logs, should we ship them to ES?

Possibly related tasks

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		lbowmaker	T292218 Implement Automated Batch Execution for Non-Platform Engineering
		Declined		lbowmaker	T292747 Define and Create Logging Routines - Airflow UI

Event Timeline

lbowmaker created this task.Oct 7 2021, 2:10 PM

lbowmaker moved this task from Backlog to Product Roadmap on the Generated Data Platform board.Oct 7 2021, 2:14 PM

lbowmaker moved this task from Product Roadmap to Backlog on the Generated Data Platform board.Oct 19 2021, 6:55 PM

gmodena updated the task description. (Show Details)Oct 26 2021, 6:46 PM

@gmodena - my thoughts for this task was to do something simple to also support a basic use case for a dataset producer/platform engineer. For example, 'I expected my Airflow DAG to generate a file, I can't see the file so I'll check for any errors in Airflow and the UI will show me the stack trace'.

So in scope for this could be something as simple as:

Stack trace of code failures (anytime exception is thrown)
Strack trace and run can be viewed in Airflow UI
Logs/runs are deleted after X days

Seems https://phabricator.wikimedia.org/T275165 is done? Is that included in the latest version of the algorithm after refactoring?

In T292747#7459654, @lbowmaker wrote:

@gmodena - my thoughts for this task was to do something simple to also support a basic use case for a dataset producer/platform engineer.

Thanks for clarifying. The following we should already have "for free" with the current scheduler deployment:

Stack trace of code failures (anytime exception is thrown)

Strack trace and run can be viewed in Airflow UI

For the following:

Logs/runs are deleted after X days

I'll need to validate if we can configure retention/rotation policies ourselves, or would need to interface with SRE. I'll get back to you on this. Either way, not a problem.

Seems https://phabricator.wikimedia.org/T275165 is done? Is that included in the latest version of the algorithm after refactoring?

We do have logic for generating those metrics/reports manually, but we do not orchestrate it in the data pipeline yet.

lbowmaker renamed this task from Define and Create Logging Routines to Define and Create Logging Routines - Airflow UI.Oct 27 2021, 6:54 PM

lbowmaker updated the task description. (Show Details)

lbowmaker moved this task from Backlog to Ready/Groomed 📚 on the Generated Data Platform board.

mforns subscribed.Nov 5 2021, 3:52 PM

odimitrijevic added a project: Data Pipelines.Nov 8 2021, 5:20 PM

lbowmaker moved this task from Ready/Groomed 📚 to Backlog on the Generated Data Platform board.Nov 9 2021, 3:06 PM

lbowmaker removed a project: Generated Data Platform.Aug 26 2022, 2:16 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:54 PM

lbowmaker edited projects, added Data-Engineering; removed Data Engineering and Event Platform Team.Nov 10 2023, 2:50 PM

No longer needed

Define and Create Logging Routines - Airflow UIClosed, DeclinedPublicActions