Page MenuHomePhabricator

Fix generation of _IMPORTED flags by Gobblin
Closed, ResolvedPublic

Description

Our current setup of Gobblin should generate _IMPORTED flags after importing data from a topic.

It's not always the case since:

Since we are planning to migrate the gobblin process to Airflow. We could add a task to generate the flag.

But for a short term solution, we could generate 2 canary events per hour from Airflow to make sure the Gobblin process will read data spanning 2 hours:

        H-1                                       H
--------+------+----------+--------------+--------+------+----------+-----------
               Canary1    Prev Gobblin   Canary2         Canary3    Gobblin

               ]====================================================] Gobblin read span

In the previous schema, we can see that Gobblin will read data from 2 hours with the events: Canary2 and Canary3.

We could run:

  • Gobblin at HH:15
  • Airflow canary_event at HH:02 and HH:35

Event Timeline

Change #1032715 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Run Gobblin later to let time for Canary events

https://gerrit.wikimedia.org/r/1032715

Change #1032715 merged by Btullis:

[operations/puppet@production] Run Gobblin later to let time for Canary events

https://gerrit.wikimedia.org/r/1032715

@Antoine_Quhen now that canary events are being produced twice an hour, can we resolve this?

Antoine_Quhen claimed this task.