Our current setup of Gobblin should generate _IMPORTED flags after importing data from a topic.
It's not always the case since:
- Sometimes, the legacy canary event generator (from systemd) is down without crashing.
- Our new canary event generator triggered from Airflow is generating a single canary event per hour.
- Our setup of Gobblin expects multiple events per hour to flag an hour. https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/gobblin-wmf/+/refs/heads/main/gobblin-wmf-core/src/main/java/org/wikimedia/gobblin/publisher/TimePartitionedFlagDataPublisher.java#129
- And some of our topics get no events for some hours, and they strongly rely on canary events.
Since we are planning to migrate the gobblin process to Airflow. We could add a task to generate the flag.
But for a short term solution, we could generate 2 canary events per hour from Airflow to make sure the Gobblin process will read data spanning 2 hours:
H-1 H --------+------+----------+--------------+--------+------+----------+----------- Canary1 Prev Gobblin Canary2 Canary3 Gobblin ]====================================================] Gobblin read span
In the previous schema, we can see that Gobblin will read data from 2 hours with the events: Canary2 and Canary3.
We could run:
- Gobblin at HH:15
- Airflow canary_event at HH:02 and HH:35