We need to take care of late-arrived events coming from Gobblin. Some events could arrive in the raw repository several hours after importation.
The systemd based Refine scheduler handles arrival of raw late events by scanning the past 26 hours for data that needs refined. Even if an hourly partition has already been refined, late arrival of raw data within those 26 hours will cause the relevant partitions to be automatically re-refined.
This will not be the case with the new Airflow based RefineDataset job. We could consider implementing something similar by using per-partition done flag timestamps for Hive+Parquet based tables. However Iceberg based tables are not compatible with per-partition done flags. We'll need another mechanism.
To address this, we propose the following solutions:
Alert on late arrival and manually rerun Airflow tasks
It is possible that late arrival happens infrequently enough that an alert and manual rerun would be sufficient.
Run Another Job in Airflow ~refine_late_arrived_events:
- Create a new job in Airflow to check for late-arrived events.
- Compare the Airflow task run timestamp with the timestamp in the Gobblin done flag.
- Trigger the necessary refine tasks rerun/scheduling based on this comparison.
Add a New Step in the Newly Created Refine Airflow DAG:
- Integrate a new step into the Refine Hive hourly DAG.
- The step will wait several hours before performing the time difference check.
- Launch the necessary refine tasks rerun/scheduling based on the time difference check.
Other Solutions
- Evaluate and implement other potential solutions to ensure the timely processing of late-arrived events.
