Page MenuHomePhabricator

wdqs/wcqs reconciliation dag stuck on partition datacenter=eqiad/year=2025/month=9/day=18/hour=9
Open, MediumPublic

Description

The two reconciliation dags for wdqs&wcqs are stuck waiting for data in the event hive db.

This problem occurred in the past and is generally caused by either canary events or refinery.
Link to dags:

The root cause must be identified but the wcqs one has a failed sensor and might need a manual action even after the root cause is fixed.

Event Timeline

I manually cleared the stuck tasks, and the dag is now catching up.

I could not root cause what went wrong. The partition this task was waiting for was there. Maybe the task got stuck on a late data arrival.
@dcausse I am not familiar with this dag failure modes yet, but maybe we could increase the busy wait timeout and/or add retry on error logic?

I manually cleared the stuck tasks, and the dag is now catching up.

I could not root cause what went wrong. The partition this task was waiting for was there. Maybe the task got stuck on a late data arrival.
@dcausse I am not familiar with this dag failure modes yet, but maybe we could increase the busy wait timeout and/or add retry on error logic?

Thanks for taking care of this!

These sensors should have a pretty long timeout, I suspect that airflow misbehaved somehow, looking at the events of this sensor I see:
2025-09-18, 11:01:42 UTC 1 state mismatch Executor KubernetesExecutor(parallelism=64) reported that the task instance <TaskInstance: wcqs_streaming_updater_reconcile_hourly.wait_for_event.rdf_streaming_updater_state_inconsistency scheduled__2025-09-18T09:00:00+00:00 [queued]> finished with state failed, but the task instance's state attribute is queued. Learn more: https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#task-state-changed-externally

And the last log from the sensor:
[2025-09-18, 10:49:55 UTC] {taskinstance.py:310} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE

So somehow airflow lost track of something, I would consider this task done, but we could possibly followup with DP to see if state mismatch is a symptom of some issues we have experience with.

My assumptions re fiddling with timeouts came from the link the logs point us too.
But I don't know this system well enough to meaningfully troubleshoot.

So somehow airflow lost track of something, I would consider this task done, but we could possibly followup with DP to see if state mismatch is a symptom of some issues we have experience with.

I discussed this earlier with @JAllemandou, and it's worth flagging to DE. I'll f/up on slack.

It feels like a Refine problem:

hdfs dfs -ls -h /wmf/data/event/rdf_streaming_updater_state_inconsistency/datacenter=eqiad/year=2025/month=9/day=18/
Found 2 items
drwxr-x---   - analytics analytics-privatedata-users          0 2025-09-18 15:59 /wmf/data/event/rdf_streaming_updater_state_inconsistency/datacenter=eqiad/year=2025/month=9/day=18/hour=14
drwxr-x---   - analytics analytics-privatedata-users          0 2025-09-18 09:00 /wmf/data/event/rdf_streaming_updater_state_inconsistency/datacenter=eqiad/year=2025/month=9/day=18/hour=7

I would have expected 24 hours on 9/18, not 2.
I'm on it.

Interestingly the partitions exist in hive, but there is no folder/file because there was only canary-events present in the raw data. This feels like ok to me as long as downstream jobs wait on hive partitions and not files.

@Antoine_Quhen @JAllemandou thanks for taking a look. I think my issue is more on the sensor task that suddenly failed (instead of waiting for its allowed timeout) with:
2025-09-18, 11:01:42 UTC 1 state mismatch Executor KubernetesExecutor(parallelism=64) reported that the task instance <TaskInstance: wcqs_streaming_updater_reconcile_hourly.wait_for_event.rdf_streaming_updater_state_inconsistency scheduled__2025-09-18T09:00:00+00:00 [queued]> finished with state failed, but the task instance's state attribute is queued. Learn more: https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#task-state-changed-externally

Since it failed it did not recover the dag when the data finally arrived, is this state mismatch error something you've seen happening in your sensors?

I have not seen this error before. I'll keep it in mind when failures occur on our end.

BTracy-WMF triaged this task as Medium priority.Dec 3 2025, 8:58 PM
BTracy-WMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.