This task tracks onboarding Analytics: service-level hadoop/hive/druid/eventlogging/etc Prometheus-based alerts from Icinga to AlertManager.
These checks are within scope:
Master checks
- hadoop-hdfs-capacity-remaining-percent
- hadoop-hdfs-corrupt-blocks
- hadoop-hdfs-missing-blocks
- hadoop-hdfs-total-files-heap
- hadoop-yarn-unhealthy-workers
- hadoop-hdfs-namenode-heap-usage
- hadoop-yarn-resourcemananager-heap-usage
Worker checks
- analytics_hadoop_hdfs_datanode (JVM heap)
- analytics_hadoop_yarn_nodemanager (JVM heap)
Standby master checks
- hadoop-hdfs-namenode-heap-usage
- hadoop-yarn-resourcemananager-heap-usage
Hive checks
- hive-metastore-heap-usage
- hive-server-heap-usage
Druid checks
- druid_netflow_supervisor
- druid_coordinator_segments_unavailable_analytics
- druid_coordinator_segments_unavailable_public
Other checks
- Eventlogging checks: T309007
- Eventgate checks: T309009
- Kafka checks: T309010
- Labstore checks: T309011
- Zookeeper checks: T309012