Page MenuHomePhabricator

[Analytics] [Bug] WMDE Airflow DAGs are sending excessive SLA config requirement emails
Closed, ResolvedPublic

Description

Wikidata Analytics Bug Report

This task was generated using the Wikidata Analytics bug report form. Please use the task template linked on our project page to report bugs to the team. Thank you!

Behavior

Please provide a concise description of what you’re experiencing and what you’d expect to happen.

WMDE SWE Analytics has various Airflow DAGs for collecting metrics. Because of the currently low set SLA for the Airflow DAGs process, many of them are now failing. Specifically we have alerts for the following DAGs:

  • wd_item_sitelink_segments_weekly
  • wd_coeditors_monthly
  • wd_device_type_edits_monthly

Note that wd_query_segments_daily is experiencing a different issue, but this is covered in T382407: [Analytics] [Bug] WD qeury segments workflow is failing as this problem specifically includes issues related to the RestExternalTaskSensor that the other DAGs don't use.

Deadline

Please make the time sensitivity of this bug report clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

Unbreak as soon as possible


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check if adding environ["REQUESTS_CA_BUNDLE"] will fix this issue
    • Not needed as the SLA was met by a later run
    • We unexpectedly got emails that said they failed when in actuality they were just waiting
  • Redeploy all DAGs and make sure that they finish properly
    • Not needed
  • Investigate the time that it takes for the DAGs to normally finish to make a decision on how many days each should wait before sending an SLA notification
    • Currently they're all set to 6 hours in DagProperties
    • Current finish times are
      • wd_coeditors_monthly: Max wait of 21 hours -> new SLA of 2 days
      • wd_device_type_edits_monthly: Max wait of 22 hours -> new SLA of 2 days
      • wd_item_sitelink_segments_weekly: Normally needs to wait for the data for 3 days, so moving it to 4 day SLA (so that it can be checked in a week)
      • wd_query_segments_daily: Has alerted for 6 hour SLA but is daily, so we'll change it to 12 hours
      • wd_rest_api_metrics_monthly: Has alerted for 6 hour SLA and is monthly -> new SLA of 2 days
      • wd_rest_api_user_agents_monthly: Hasn't alerted for 6 hour SLA, but we should change -> new SLA of 2 days
  • Change SLA for all DAGs so that they wait a bit longer before sending a notification that the SLA hasn't been met

Estimation

New:
Estimate: 1 day
Actual: 1/2 day

Original:
Estimate: 1.5 days
Actual: 20 min (wasn't actually an error)

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

AndrewTavis_WMDE renamed this task from [Analytics] [Bug] Please add a descriptive title to [Analytics] [Bug] WMDE Airflow DAGs are failing due to new SLA config requirements.Jan 2 2025, 11:48 AM
AndrewTavis_WMDE claimed this task.
AndrewTavis_WMDE triaged this task as High priority.

Note the DAGs don't actually appear to be failing, but are just notifying us when they need to wait (as I understand it). Moving this to review so we can discuss the current state of notifications for the DAGs. As of now they appear to be alerting on unexpected statuses.

AndrewTavis_WMDE lowered the priority of this task from High to Low.Jan 2 2025, 4:37 PM

2025 w3 review: ticket will stay in review until we confirm how the email notifications will be updated

AndrewTavis_WMDE renamed this task from [Analytics] [Bug] WMDE Airflow DAGs are failing due to new SLA config requirements to [Analytics] [Bug] WMDE Airflow DAGs are sending excessive SLA config requirement emails.Jan 14 2025, 12:36 PM
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE moved this task from In Review to To-Do on the Wikidata Analytics (Kanban) board.

Renamed to reflect the scope a bit better. We can change the amount of time that the DAGs wait for their SLA before notifying, with the general thought being 1-2 days based on how long the DAGs normally take to finish.

Moving to review as the changes have been merged. Let's wait for another run of the weekly DAG and each of the monthly DAGs before closing this :)

So this task can be closed in the review/planning session on February 3rd or 10th.

Note further that the data report should be updated with the latest REST API metrics when we do check that this DAG has been finished successfully.

w5-w7 review: initial changes did not handle the errors enough so additional work will be done

Moved back to to-do to reflect prioritization :)

Note on this: These emails are coming even if a DAG needs to wait a bit to get added to the cue as shown when we launched the five new DAGs for the Graphite to Airflow process. They had their SLAs set to six hours as they're daily, but then we were getting emails after a few minutes.

My assumption at this point is that this is happening because of a buildup of the backlog of DAGs trying to run. I've changed the SLA settings within each of the DAGs to be more appropriate, and have asked WMF about the other SLA emails. No solution came from that, so I'd suggest that we resolve this and open a new task if this again becomes a major issue :)

Maybe reducing the wd_item_sitelink_segments DAG back to 2 days would be something that should be added to T386340. We ultimately haven't been getting the alert emails there as the DAG is off because of the issues there. Updating that task.

SLA emails have decreased. Issue has been dealt with for now and a new ticket will be created should the issue pop up again