This task tracks onboarding Analytics: service-level hadoop/hive/druid/eventlogging/etc Prometheus-based alerts from Icinga to AlertManager.
These checks are within scope:
__Master checks__
- [x] hadoop-hdfs-capacity-remaining-percent
- [x] hadoop-hdfs-corrupt-blocks
- [x] hadoop-hdfs-missing-blocks
- [x] hadoop-hdfs-total-files-heap
- [ ] hadoop-yarn-unhealthy-workers
- [ ] hadoop-hdfs-namenode-heap-usage
- [ ] hadoop-yarn-resourcemananager-heap-usage
__Worker checks__
- [ ] analytics_hadoop_hdfs_datanode
- [ ] analytics_hadoop_yarn_nodemanager
__Standby master checks__
- [ ] hadoop-hdfs-namenode-heap-usage
- [ ] hadoop-yarn-resourcemananager-heap-usage
__Hive checks__
- [ ] hive-metastore-heap-usage
- [ ] hive-server-heap-usage
__Druid checks__
- [ ] druid_netflow_supervisor
- [ ] druid_coordinator_segments_unavailable_analytics
- [ ] druid_coordinator_segments_unavailable_public
__Eventlogging checks__
- [ ] eventlogging_EventError_throughput
- [ ] eventlogging_NavigationTiming_throughput
- [ ] eventlogging_throughput
- [ ] eventlogging_processors_kafka_lag
- [ ] Each of: ${eventgate_service}_validation_error_rate
- [ ] Each of: eventgate_logging_external_latency_${site}
- [ ] Each of: eventgate_logging_external_errors_${site}
I am not certain whether or not the following are in scope, but they might be:
- [ ] Kafka checks
- [ ] Labstore checks
- [ ] Zookeeper checks