Page MenuHomePhabricator

[SPIKE] Create Generic Components for Scheduling
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story
As a platform engineer, I need to review our current DAGs to find common patterns to generalize some functionality (and possibly drop some non idiomatic pattern)
Success Criteria
  • List of common patterns, design for generalized functionality across DAGs

Event Timeline

lbowmaker renamed this task from Create Generic Components for Scheduling (needs grooming to be more tasks) to [SPIKE] Create Generic Components for Scheduling.Nov 9 2021, 2:02 PM
lbowmaker reassigned this task from lbowmaker to gmodena.
lbowmaker set Due Date to Nov 30 2021, 5:00 AM.
lbowmaker updated the task description. (Show Details)
lbowmaker moved this task from Backlog to Ready/Groomed 📚 on the Generated Data Platform board.

After review of our code and documentation, industry best practices, and internal discussion, I propose to adopt the following principles as part of our design and practices for generic dags.

Design principles

We should make sure that tasks are reproducible and follow functional programming paradigms:

  • Tasks should be idempotent.
  • Tasks should be deterministic (see ).
  • Tasks should not generate side effects.

We should provide documentation and guidelines with examples of how to achieve that.
@lbowmaker @Clarakosi maybe we could make a documentation phab task out of this?

Implementation details

From an implementation perspective, dag should follow this principles:

  • DAGs should be a thin layer and not contain any business logic, queries, computation etc.
  • Specify config details consistently, by moving parameters to a config file.
  • Group tasks in the Airflow UI, to make it explicit to users the status of data processing / enrichment / export steps of a data pipeline.
  • Avoid local computations (our airflow instance shares scheduler and task executor processes). Currently we target the YARN resource manager of elastic compute. Data pipeline are implemented atop Apache Spark. Spark jobs must be submitted from the airflow instance, and run in cluster mode.

image-matching

In order to embrace the above principles, we should refactor image-matching to:

These changes depend on https://phabricator.wikimedia.org/T292740 and https://phabricator.wikimedia.org/T280585. Once those tasks are merged, refactoring the rest will be trivial.

Further improvements

This could be split into a separate task, but we should

This comment was removed by gmodena.

@gmodena / @lbowmaker: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

lbowmaker updated the task description. (Show Details)