Page MenuHomePhabricator

When gobblin fails, we should know about it
Closed, ResolvedPublic

Description

While migrating event ingestion to gobblin, we encountered a situation where a gobblin run was killed or failed and did not remove its lock file. This caused subsequent runs of the job to fail launching. When this happened, we did not receive any systemd service failure alert emails.

There are several issues here:

  1. Gobblin should remove its lock file if it is killed or if it fails.
  2. Gobblin should exit non zero if it fails, and we should receive an email alert.
  3. Gobblin should exit non zero if it cannot launch due to lock file, and we should receive an alert.

I'm not sure if 3. is the right behavior; there will be natural moments e.g. during backfill, when a Gobblin job will take longer than its usual schedule, and the next run will not be able to launch until the backfill is finished. I suppose in this case an alert is ok, but would be a false alarm and would just need to be ACKed.

Event Timeline

mforns triaged this task as High priority.Jul 19 2021, 3:39 PM
mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 705970 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Don't use Gobblin lock but rather yarn check

https://gerrit.wikimedia.org/r/705970

Change 705970 merged by Ottomata:

[analytics/refinery@master] Don't use Gobblin lock but rather yarn check

https://gerrit.wikimedia.org/r/705970

Change 706696 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add property disabling gobblin lock

https://gerrit.wikimedia.org/r/706696

Change 706696 merged by Ottomata:

[analytics/refinery@master] Add property disabling gobblin lock

https://gerrit.wikimedia.org/r/706696