Page MenuHomePhabricator

Alarm when druid indexation fails
Closed, ResolvedPublic

Description

Alarm when druid indexation fails

For issues like the one we have seen on netflow we need an alarm when the druid indexation fails having in mind that the spark job might succeed but indexation still might fail.

Event Timeline

If these jobs are scheduled through a scheduler like oozie we know of the status of the job cause oozie will get from yarn. Now, when they are scheduled through systemd timers we loose the ability to query yarn for the job status and maybe the only recourse here is to send an e-mail

Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.
Milimetric subscribed.

We can use refine monitor the same way we already use it, so Icinga can look for the flags.

@Ottomata since we are now running the jobs on an-launcher1002, maybe we could try to run them with spark local mode? It should add some overhead but we have a lot of unused ram atm and it will be easy to rollback if needed.

Change 617735 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set spark deploy-mode client for all the Analytics Hive to Druid jobs

https://gerrit.wikimedia.org/r/617735

Change 617735 merged by Elukey:
[operations/puppet@production] Set spark deploy-mode client for all the Analytics Hive to Druid jobs

https://gerrit.wikimedia.org/r/617735

Mentioned in SAL (#wikimedia-analytics) [2020-08-03T09:53:26Z] <elukey> move all druid-related systemd timer to spark client mode - T254493

Change 618017 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix config file path in Eventlogging to Druid jobs

https://gerrit.wikimedia.org/r/618017

Change 618017 merged by Elukey:
[operations/puppet@production] Fix config file path in Eventlogging to Druid jobs

https://gerrit.wikimedia.org/r/618017

IIUC we should be covered with alarming druid indexations right? Both timers and oozie coords should be able to inform us when the fail (let me know if I am missing something).

elukey set Final Story Points to 5.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
elukey added a subscriber: fdans.

Change 618227 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix Spark config file path for Daily EL2Druid Analyitcs jobs

https://gerrit.wikimedia.org/r/618227

Change 618227 merged by Elukey:
[operations/puppet@production] Fix Spark config file path for Daily EL2Druid Analyitcs jobs

https://gerrit.wikimedia.org/r/618227

Change 618229 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix daily config file path for Spark EL2Druid Analytics jobs

https://gerrit.wikimedia.org/r/618229

Change 618229 merged by Elukey:
[operations/puppet@production] Fix daily config file path for Spark EL2Druid Analytics jobs

https://gerrit.wikimedia.org/r/618229