Page MenuHomePhabricator

Alarms for virtualpageview should exist (probably in oozie) for jobs that have been idle too long
Closed, ResolvedPublic

Event Timeline

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

As FYI I can see the following for Dec 17th in my inbox for analytics-alerts@:

OOZIE - SLA END_MISS (AppName=virtualpageview-hourly-coord, JobID=0001715-180905070129339-oozie-oozi-C@2450)

  SLA Status - END_MISS
  Job Status - WAITING
Job Details:
  App Name - virtualpageview-hourly-coord
  User - hdfs
  Job ID - 0001715-180905070129339-oozie-oozi-C@2450
  Job URL - http://an-coord1001.eqiad.wmnet:11000/oozie/?job=0001715-180905070129339-oozie-oozi-C@2450
  Parent Job ID - 0001715-180905070129339-oozie-oozi-C
  Parent Job URL - http://an-coord1001.eqiad.wmnet:11000/oozie/?job=0001715-180905070129339-oozie-oozi-C
SLA Details:
  Nominal Time - Mon Dec 17 13:40:00 UTC 2018
  Expected End Time - Mon Dec 17 19:40:00 UTC 2018
  Expected Duration (in mins) - -1
  Actual Duration (in mins) - -1

As far as I can see it we missed to follow up, so probably Oozie should keep sending emails when a job is waiting for too long?

The hue dashboard for workflows displays SLAS (probably not big news) . Looked at docs and i saw we can add an alarm for durantion of job which "seems" to be different than sla for end time. Now, using oozie's sla bindings there is no way to send repeated alarms .

I have a suggestion. We could set the <timeout>XXX</timeout> control in oozie coordinators, replacing XXX with number of seconds before the materialized job times-out. When a job times-out it is marked as failed by oozie, sending us a failure email. I think an error email would make react stronger, as it means data will be missing. Also, the action might just be to re-run the job, which is simple in hue. Finally, this is the approach we have in webrequest-load, which has proven succesfull.

mforns lowered the priority of this task from High to Medium.Mar 25 2019, 4:30 PM