Twice in the last month (and possibly 5 times since March), we have received RefineMonitor alerts about wdqs-external.sparql-query ingestion into Hive. These alerts are due to late events in the eqiad.wdqs-external.sparql-query Kafka topic.
We import topic data into HDFS hourlyish, and consider an hour done once events being imported by a task have timestamps greater than the previous hour. Consider the contents of the raw topic data we imported for eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04:
12:37:56 [@an-launcher1002:/home/otto] $ hdfs dfs -ls /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04 Found 4 items -rw-r----- 3 analytics analytics-privatedata-users 0 2022-06-14 05:09 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/_IMPORTED -rw-r----- 3 analytics analytics-privatedata-users 6700493 2022-06-14 04:06 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655179519742_167_1.txt.gz -rw-r----- 3 analytics analytics-privatedata-users 64539424 2022-06-14 05:07 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655183121813_167_0.txt.gz -rw-r----- 3 analytics analytics-privatedata-users 775 2022-06-15 09:06 /wmf/data/raw/event/eqiad.wdqs-external.sparql-query/year=2022/month=06/day=14/hour=04/part.task_event_default_1655283918923_166_1.txt.gz
Based on the mtimes of the files, you can see that data for this hour was first imported by the task that ran at 2022-06-14 04:06, then again at 2022-06-14 05:07. At 2022-06-14 05:07, the events imported in that hour progressed beyond hour 04, and so the _IMPORTED flag file was written at 2022-06-14 05:09 when that task finished.
However, a full day and 4 hours later, some events arrived in the topic with timestamps in the 2022-06-14T04:00:* hour.
Once an _IMPORTED flag has been written, we do not automatically re-run the Refine job if new data comes in. We do alert on this fact. It seems we caught it 2 weeks ago, and today, but there may have been 3 other times in the past when this happened (the past data has since been deleted).
Is there a reason why late events are more likely to happen for the wdqs-external.sparql-query stream than other streams?