Wikidata Analytics Bug Report
This task was generated using the Wikidata Analytics bug report form. Please use the task template linked on our project page to report bugs to the team. Thank you!
Behavior
Please provide a concise description of what you’re experiencing and what you’d expect to happen.
WMDE SWE Analytics has various Airflow DAGs for collecting metrics. Because of the currently low set SLA for the Airflow DAGs process, many of them are now failing. Specifically we have alerts for the following DAGs:
- wd_item_sitelink_segments_weekly
- wd_coeditors_monthly
- wd_device_type_edits_monthly
Note that wd_query_segments_daily is experiencing a different issue, but this is covered in T382407: [Analytics] [Bug] WD qeury segments workflow is failing as this problem specifically includes issues related to the RestExternalTaskSensor that the other DAGs don't use.
Deadline
Please make the time sensitivity of this bug report clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
Unbreak as soon as possible
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Check if adding environ["REQUESTS_CA_BUNDLE"] will fix this issue
- Not needed as the SLA was met by a later run
- We unexpectedly got emails that said they failed when in actuality they were just waiting
- Redeploy all DAGs and make sure that they finish properly
- Not needed
- Investigate the time that it takes for the DAGs to normally finish to make a decision on how many days each should wait before sending an SLA notification
- Currently they're all set to 6 hours in DagProperties
- Current finish times are
- wd_coeditors_monthly: Max wait of 21 hours -> new SLA of 2 days
- wd_device_type_edits_monthly: Max wait of 22 hours -> new SLA of 2 days
- wd_item_sitelink_segments_weekly: Normally needs to wait for the data for 3 days, so moving it to 4 day SLA (so that it can be checked in a week)
- wd_query_segments_daily: Has alerted for 6 hour SLA but is daily, so we'll change it to 12 hours
- wd_rest_api_metrics_monthly: Has alerted for 6 hour SLA and is monthly -> new SLA of 2 days
- wd_rest_api_user_agents_monthly: Hasn't alerted for 6 hour SLA, but we should change -> new SLA of 2 days
- Change SLA for all DAGs so that they wait a bit longer before sending a notification that the SLA hasn't been met
- Example where this is done is clickstream_monthly_dag.py#L141
Estimation
New:
Estimate: 1 day
Actual: 1/2 day
Original:
Estimate: 1.5 days
Actual: 20 min (wasn't actually an error)
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note