While migrating event ingestion to gobblin, we encountered a situation where a gobblin run was killed or failed and did not remove its lock file. This caused subsequent runs of the job to fail launching. When this happened, we did not receive any systemd service failure alert emails.
There are several issues here:
- Gobblin should remove its lock file if it is killed or if it fails.
- Gobblin should exit non zero if it fails, and we should receive an email alert.
- Gobblin should exit non zero if it cannot launch due to lock file, and we should receive an alert.
I'm not sure if 3. is the right behavior; there will be natural moments e.g. during backfill, when a Gobblin job will take longer than its usual schedule, and the next run will not be able to launch until the backfill is finished. I suppose in this case an alert is ok, but would be a false alarm and would just need to be ACKed.