Page MenuHomePhabricator

Harmonize tags across Airflow dags
Closed, ResolvedPublic3 Estimated Story Points

Description

If you search for druid in the Airflow UI, you may not get the whole list of dags interacting with Druid.

We should be able to automatically add tags to a dag when we add a specific kind of operator. e.g., using the HiveToDruidOperator => adding the druid tag

And generally speaking, we should review the tags added to all the dags we have created.

  • DE 30 min meeting
  • Implementation
  • Publish criteria

Details

TitleReferenceAuthorSource BranchDest Branch
Harmonize Tags on Airflowrepos/data-engineering/airflow-dags!460jebeT336744-Harmonize-tags-in-Airflow-dagsmain
Customize query in GitLab

Event Timeline

+1 to revise all tags manually.

Not sure whether automatizing this will add value, as the manual review should take an hour?

Today we decided not to automatize the tagging process.

JArguello-WMF updated the task description. (Show Details)
JArguello-WMF set the point value for this task to 3.

Here's a list of all Airflow analytics tags as of 2023-06-22.
https://docs.google.com/spreadsheets/d/1XtvtLeZUWIWmEYGF9JYukeszZZ1oO0je8FJVbOKyxpA

The tag types we have so far include: data destination, type of data generated, dataset family, technology used, update schedule, job purpose, DAG property.

Some things I noticed that we could improve (subjective, of course!):

  • Lots of tags are used only once, usually referring to a small dataset family (i.e. referrer, metadata, virtualpageview). I think these are not useful, since they do not group DAGs into categories. They don't add much information.
  • Some tags are used in many DAGs, and could possibly be used in almost all DAGs (i.e. hive, hql, spark). I think these are not useful either, since they don't filter out enough DAGs. Also not much information added.
  • Some tags are vague (i.e. alert, ingest, load, email, check). Most DAGs do load data and ingest in some way, they also alert and send emails in case of failure. Also not much info added.
  • Some tags are duplicated (i.e. druid and druid_load, history and mediawiki_history, dump and dumps). We should unify those, no?
  • Some tags give information that is already in the DAGs summary in Airflow Home (i.e. hourly, daily, monthly, yearly). I don't think we should duplicate that info?

We did a 'functional` decomposition of current datasets available under the wmf database in Hive over at T337562: Decide how to split wmf database into functional areas. I think the categorization we got done there could help here as well:

contributors
data_ops
experiments
mediawiki
readership
traffic
wikidata

You can see rationale for this at T337562, and the final output available at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Iceberg#Changes_to_database_names.

The data engineering team had a meeting and the conclusion was capture tags based on
*Frequency,
*Ownership,
*Criticality,
*Requires a certain table e ie Webrequest
*Destination of data source ie Iceberg, Hive

  • Remove tags that do not meet this Criteria