Page MenuHomePhabricator

Late events in wdqs-external.sparql-query?
Open, Needs TriagePublic

Description

Twice in the last month (and possibly 5 times since March), we have received RefineMonitor alerts about wdqs-external.sparql-query ingestion into Hive. These alerts are due to late events in the eqiad.wdqs-external.sparql-query Kafka topic.

We import topic data into HDFS hourlyish, and consider an hour done once events being imported by a task have timestamps greater than the previous hour. Consider the contents of the raw topic data we imported for eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04:

12:37:56 [@an-launcher1002:/home/otto] $ hdfs dfs -ls /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04
Found 4 items
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-14 05:09 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users    6700493 2022-06-14 04:06 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655179519742_167_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users   64539424 2022-06-14 05:07 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655183121813_167_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users        775 2022-06-15 09:06 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655283918923_166_1.txt.gz

Based on the mtimes of the files, you can see that data for this hour was first imported by the task that ran at 2022-06-14 04:06, then again at 2022-06-14 05:07. At 2022-06-14 05:07, the events imported in that hour progressed beyond hour 04, and so the _IMPORTED flag file was written at 2022-06-14 05:09 when that task finished.

However, a full day and 4 hours later, some events arrived in the topic with timestamps in the 2022-06-14T04:00:* hour.

Once an _IMPORTED flag has been written, we do not automatically re-run the Refine job if new data comes in. We do alert on this fact. It seems we caught it 2 weeks ago, and today, but there may have been 3 other times in the past when this happened (the past data has since been deleted).

Is there a reason why late events are more likely to happen for the wdqs-external.sparql-query stream than other streams?

Event Timeline