Alarm when druid indexation fails
For issues like the one we have seen on netflow we need an alarm when the druid indexation fails having in mind that the spark job might succeed but indexation still might fail.
Alarm when druid indexation fails
For issues like the one we have seen on netflow we need an alarm when the druid indexation fails having in mind that the spark job might succeed but indexation still might fail.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | elukey | T254383 Investigate why netflow hive_to_druid job is so slow | |||
| Resolved | elukey | T254493 Alarm when druid indexation fails |
If these jobs are scheduled through a scheduler like oozie we know of the status of the job cause oozie will get from yarn. Now, when they are scheduled through systemd timers we loose the ability to query yarn for the job status and maybe the only recourse here is to send an e-mail
We can use refine monitor the same way we already use it, so Icinga can look for the flags.
@Ottomata since we are now running the jobs on an-launcher1002, maybe we could try to run them with spark local mode? It should add some overhead but we have a lot of unused ram atm and it will be easy to rollback if needed.
Change 617735 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set spark deploy-mode client for all the Analytics Hive to Druid jobs
Change 617735 merged by Elukey:
[operations/puppet@production] Set spark deploy-mode client for all the Analytics Hive to Druid jobs
Mentioned in SAL (#wikimedia-analytics) [2020-08-03T09:53:26Z] <elukey> move all druid-related systemd timer to spark client mode - T254493
Change 618017 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix config file path in Eventlogging to Druid jobs
Change 618017 merged by Elukey:
[operations/puppet@production] Fix config file path in Eventlogging to Druid jobs
IIUC we should be covered with alarming druid indexations right? Both timers and oozie coords should be able to inform us when the fail (let me know if I am missing something).
Change 618227 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix Spark config file path for Daily EL2Druid Analyitcs jobs
Change 618227 merged by Elukey:
[operations/puppet@production] Fix Spark config file path for Daily EL2Druid Analyitcs jobs
Change 618229 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix daily config file path for Spark EL2Druid Analytics jobs
Change 618229 merged by Elukey:
[operations/puppet@production] Fix daily config file path for Spark EL2Druid Analytics jobs