Page MenuHomePhabricator

Investigate oozie banner monthly job timeouts
Closed, ResolvedPublic

Description

Old Oozie banner monthly instances are marked as timedout while no timeout is defined for those jobs.
This means sanitization/compaction has not run for those month.

Event Timeline

Hm, this looks to be related to or caused by T222603: Fix oozie banner_impression monthly job.

The failed months timed out waiting for a dependency 2 months and 19 days after the month in question. See:
https://hue.wikimedia.org/oozie/list_oozie_coordinator/0000497-191216160148723-oozie-oozi-C/

For example, the March monthly job had a dependency for /wmf/data/wmf/banner_activity/daily/year=2020/month=5/day=19/_SUCCESS.

@JAllemandou is this a problem with the way the dependency is defined, or is this correct? Are you just saying that the job needs a longer timeout?

Are you just saying that the job needs a longer timeout?

Actually there is no timeout defined for this coordinator (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/banner_activity/druid/monthly/coordinator.xml#L39)
I wonder if the problem could come from oozie DB retention being less than 3 months, and forcing jobs to timeout even if they have no defined timeout value.
The failed instance of https://hue.wikimedia.org/oozie/list_oozie_coordinator/0034295-200303081945184-oozie-oozi-C/ gives hints in that sense. This is the only other coordinator I know of that has waiting period longer than 3 months.

I wonder if the -1 timeout in coordinator.xml conflicts with the Oozie property [[ https://oozie.apache.org/docs/3.1.3-incubating/oozie-default.xml | oozie.service.coord.default.max.timeout ]], which defaults to 60 days. It does seem likely, as the workflows and being put in TIMEDOUT 60 days after the created time.

I'm going to push a patch to set this property oozie puppet to 13 months, restart Oozie, and re run the few jobs we can. I can't really reproduce this at the moment, but 60 days after Oct 15 we should know if the October monthly job times out or not.

Change 644535 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set oozie.service.coord.default.max.timeout to 13 months

https://gerrit.wikimedia.org/r/644535

Change 644535 merged by Elukey:
[operations/puppet@production] Set oozie.service.coord.default.max.timeout to 13 months

https://gerrit.wikimedia.org/r/644535

Mentioned in SAL (#wikimedia-analytics) [2020-12-02T07:49:23Z] <elukey> restart oozie to pick up new settings for T264358

I relaunched timedout tasks, manually recreated missing _SUCCESS flags, and jobs succeeded! This looks like a solved problem to me :)

Well, those ones would succeed because now the _SUCCESS flags exist. The problem is when the job times out before the _SUCCESS flags exist. If the October job succeeds without timing out before the the Dec 19th dependency exists, we'll know we fixed it.

From https://github.com/wikimedia/analytics-refinery/blob/master/oozie/banner_activity/druid/monthly/coordinator.xml#L52, the dataset dependency is set up to wait for both:

  • the full running momth month of daily _SUCCESS files (every days of October must have a success-flag for the octboer job to start)
  • The daily _SUCCESS file of the one day happening 80 days after the first of the running month (December 19th success-flag must be present for the October job to start)

From https://github.com/wikimedia/puppet/blob/6b7f98ce8a96e4ea09c4a170b4b25773a0b9641b/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L244 we purge daily success-files after 90 days.

  • On December 19th success-files will be purged up to September 20th.

The October job will therefore be able to run as expected :)

The problem we were having was that the jobs were timingout before execution time due to the long delay (longer than 60 days), leading to success-files being deleted for them when restarted.
I confirm this is solved for me :)

The problem we were having was that the jobs were timingout before execution time due to the long delay (longer than 60 days), leading to success-files being deleted for them when restarted.

Right but, I assume the months prior to October succeeded now because you re-ran manually, and their 80 day future dependencies all exist. For October, the timeout was 60 days (I'm not sure when that timeout starts counting...from Oct 1?). If the fix worked, then the Oct job should still stay in waiting mode for the Dec 19 dependency all the way up til Dec 19 and not be killed before then (when the 60 day timeout would have been reached). I guess if the 60 day timeout starts counting from Oct 1, then we should indeed be beyond it and I'd expect the Oct job to launch and succeed after Dec 19.

Ah! I get it ! sorry for being slow :)

The October job still timed out after 60 days :(

https://hue.wikimedia.org/oozie/list_oozie_coordinator/0000497-191216160148723-oozie-oozi-C/

15:08:15 [@an-launcher1002:/home/otto] $ oozie job -info 0000497-191216160148723-oozie-oozi-C@13
ID : 0000497-191216160148723-oozie-oozi-C@13
------------------------------------------------------------------------------------------------------------------------------------
Action Number        : 13
Console URL          : -
Error Code           : -
Error Message        : -
External ID          : -
External Status      : -
Job ID               : 0000497-191216160148723-oozie-oozi-C
Tracker URI          : -
Created              : 2020-10-15 01:19 GMT
Nominal Time         : 2020-10-01 00:00 GMT
Status               : TIMEDOUT
Last Modified        : 2020-12-14 01:21 GMT
First Missing Dependency : hdfs://analytics-hadoop/wmf/data/wmf/banner_activity/daily/year=2020/month=12/day=19/_SUCCESS

I'm going to explicitly set a 90 day timeout on the coordinator.

Change 656185 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Explicitly set timeout in banner_activity-druid-monthly-coord

https://gerrit.wikimedia.org/r/656185

Change 656185 merged by Ottomata:
[analytics/refinery@master] Explicitly set timeout in banner_activity-druid-monthly-coord

https://gerrit.wikimedia.org/r/656185

Mentioned in SAL (#wikimedia-operations) [2021-01-14T15:16:55Z] <otto@deploy1001> Started deploy [analytics/refinery@1117f45]: Explicitly set timeout in banner_activity-druid-monthly-coord - T264358

Mentioned in SAL (#wikimedia-operations) [2021-01-14T15:19:12Z] <otto@deploy1001> Finished deploy [analytics/refinery@1117f45]: Explicitly set timeout in banner_activity-druid-monthly-coord - T264358 (duration: 02m 16s)

OH! DOH!

Nope, I had just followed the link to the oozie coordinator job that we killed in December. DOH!

https://hue.wikimedia.org/oozie/list_oozie_coordinator/0021849-201202074829419-oozie-oozi-C/ is running just fine with the original timeout. Reverting my change and not restarting the job.

Change 656192 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Revert Explicitly set timeout in banner_activity-druid-monthly-coord back to -1

https://gerrit.wikimedia.org/r/656192

Change 656192 merged by Ottomata:
[analytics/refinery@master] Revert Explicitly set timeout in banner_activity-druid-monthly-coord back to -1

https://gerrit.wikimedia.org/r/656192