Page MenuHomePhabricator

Migrate the majority of the analytics cluster alerts from Icinga to AlertManager
Closed, ResolvedPublic

Description

This task tracks onboarding Analytics: service-level hadoop/hive/druid/eventlogging/etc Prometheus-based alerts from Icinga to AlertManager.

These checks are within scope:

Master checks

  • hadoop-hdfs-capacity-remaining-percent
  • hadoop-hdfs-corrupt-blocks
  • hadoop-hdfs-missing-blocks
  • hadoop-hdfs-total-files-heap
  • hadoop-yarn-unhealthy-workers
  • hadoop-hdfs-namenode-heap-usage
  • hadoop-yarn-resourcemananager-heap-usage

Worker checks

  • analytics_hadoop_hdfs_datanode (JVM heap)
  • analytics_hadoop_yarn_nodemanager (JVM heap)

Standby master checks

  • hadoop-hdfs-namenode-heap-usage
  • hadoop-yarn-resourcemananager-heap-usage

Hive checks

  • hive-metastore-heap-usage
  • hive-server-heap-usage

Druid checks

  • druid_netflow_supervisor
  • druid_coordinator_segments_unavailable_analytics
  • druid_coordinator_segments_unavailable_public

Other checks

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Doing a quick search of the puppet repo for these alerts reveals the following checks that I think are in scope.

Master checks:

  • hadoop-hdfs-capacity-remaining-percent
  • hadoop-hdfs-corrupt-blocks
  • hadoop-hdfs-missing-blocks
  • hadoop-hdfs-total-files-heap
  • hadoop-yarn-unhealthy-workers
  • hadoop-hdfs-namenode-heap-usage
  • hadoop-yarn-resourcemananager-heap-usage

Worker checks:

  • analytics_hadoop_hdfs_datanode
  • analytics_hadoop_yarn_nodemanager

Standby master checks:

  • hadoop-hdfs-namenode-heap-usage
  • hadoop-yarn-resourcemananager-heap-usage

Hive checks:

  • hive-metastore-heap-usage
  • hive-server-heap-usage

Druid checks:

  • druid_netflow_supervisor
  • druid_coordinator_segments_unavailable_analytics
  • druid_coordinator_segments_unavailable_public

Eventlogging checks:

  • eventlogging_EventError_throughput
  • eventlogging_NavigationTiming_throughput
  • eventlogging_throughput
  • eventlogging_processors_kafka_lag
  • Each of: ${eventgate_service}_validation_error_rate
  • Each of: eventgate_logging_external_latency_${site}
  • Each of: eventgate_logging_external_errors_${site}

I haven't included the following, although they could perhaps be in scope. I will check before taking any action with regard to these.

Kafka checks
Labstore checks
Zookeeper checks

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Change 731884 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the data-engineering team to Alertmanager

https://gerrit.wikimedia.org/r/731884

Change 731919 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add the first data-engineering team alert to Alertmanager

https://gerrit.wikimedia.org/r/731919

Change 731921 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove HDFS Capacity Remaining check from Icinga

https://gerrit.wikimedia.org/r/731921

I have made some progress on the migration by carrying out the following:

They need to be deployed in this order and the icinga check shouldn't be deployed until the alertmanager check of the free capacity has been suitably tested.

Change 731884 merged by Btullis:

[operations/puppet@production] Add the data-engineering team to Alertmanager

https://gerrit.wikimedia.org/r/731884

Change 731919 merged by jenkins-bot:

[operations/alerts@master] Add the first data-engineering team alert to Alertmanager

https://gerrit.wikimedia.org/r/731919

Change 732623 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Correct the team-data-engineering file names

https://gerrit.wikimedia.org/r/732623

Change 732623 merged by jenkins-bot:

[operations/alerts@master] Correct the team-data-engineering alerts

https://gerrit.wikimedia.org/r/732623

I have confirmed the presence of the first rule that has been added, by using an SSH tunnel and checking the Prometheus web interface.

image.png (1×1 px, 318 KB)

Moving onto the next set of rules now.

Change 731921 merged by Btullis:

[operations/puppet@production] Remove HDFS Capacity Remaining check from Icinga

https://gerrit.wikimedia.org/r/731921

Change 732748 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add an alert for HDFS corrupt blocks

https://gerrit.wikimedia.org/r/732748

Change 732922 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the HDFS corrupt blocks check from Icinga

https://gerrit.wikimedia.org/r/732922

Change 732748 merged by jenkins-bot:

[operations/alerts@master] Add an alert for HDFS corrupt blocks

https://gerrit.wikimedia.org/r/732748

Change 732993 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add three more HDFS related checks to alertmanager

https://gerrit.wikimedia.org/r/732993

Change 732993 merged by jenkins-bot:

[operations/alerts@master] Add three more HDFS related checks to alertmanager

https://gerrit.wikimedia.org/r/732993

Change 734662 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove three more HDFS checks from Icinga

https://gerrit.wikimedia.org/r/734662

Something that occurred to me while reading the list: (thank you @BTullis for putting that together!) the "same" check for different systems could be grouped together too. For example "jvm heap usage" could be one alert rule with appropriate selectors per-service, and not three different rules (with the same alert name). Hope that helps! Not a requirement but something I thought of

Change 735669 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add more alerts to the data-engineering team

https://gerrit.wikimedia.org/r/735669

Change 736279 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add checks for druid datasources to alertmanager

https://gerrit.wikimedia.org/r/736279

Change 736280 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove prometheus based Druid checks

https://gerrit.wikimedia.org/r/736280

I am making significant progress on this now and have several more CRs to merge.

Al HDFS and Yarn related checks have been addressed, along with the JVM memory usage checks.
I have also created patches to migrate the Druid checks.

I have discovered an issue in the current monitoring of the Eventgate latency, which I have descibed in T294911: Apparent latency warning in 90th centile of eventgate-logging-external so I will leave this check for now and move on to the eventlogging related checks.

Change 736490 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add the first eventgate alert to Alertmanager

https://gerrit.wikimedia.org/r/736490

Change 736279 merged by Btullis:

[operations/alerts@master] Add checks for druid datasources to alertmanager

https://gerrit.wikimedia.org/r/736279

Change 736280 merged by Btullis:

[operations/puppet@production] Remove prometheus based Druid checks

https://gerrit.wikimedia.org/r/736280

Change 736490 merged by jenkins-bot:

[operations/alerts@master] Add the first eventgate alert to Alertmanager

https://gerrit.wikimedia.org/r/736490

Change 740128 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Update the way that the unavailable druid segment alert works

https://gerrit.wikimedia.org/r/740128

Change 740128 merged by Btullis:

[operations/alerts@master] Update the way that the unavailable druid segment alert works

https://gerrit.wikimedia.org/r/740128

Change 735669 merged by jenkins-bot:

[operations/alerts@master] Add more alerts to the data-engineering team

https://gerrit.wikimedia.org/r/735669

Change 734662 merged by Btullis:

[operations/puppet@production] Remove three more HDFS checks from Icinga

https://gerrit.wikimedia.org/r/734662

Change 732922 merged by Btullis:

[operations/puppet@production] Remove the HDFS corrupt blocks check from Icinga

https://gerrit.wikimedia.org/r/732922

Change 744809 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove more alerts that have moved to alertmanager

https://gerrit.wikimedia.org/r/744809

Change 744813 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Remove duplicate cluster variable from Druid check

https://gerrit.wikimedia.org/r/744813

Change 744813 merged by jenkins-bot:

[operations/alerts@master] Remove duplicate cluster variable from Druid check

https://gerrit.wikimedia.org/r/744813

Change 776919 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Remove an-test-coord* from the Hive JVM heap memory alerts

https://gerrit.wikimedia.org/r/776919

Change 776919 merged by jenkins-bot:

[operations/alerts@master] Remove test hosts from the JVM heap memory alerts

https://gerrit.wikimedia.org/r/776919

Having reviewed this with the Data Engineering team, we would like to create separate tasks for the remainder of the icinga checks than need migrating and close this ticket, if that's acceptable.
We've certainly become familiar with alertmanager and it's used for any new threshold checks wherever possible, but we haven't been able to fnd the time to complete the full migration mentioned in the description.
Is this OK with you @fgiunchedi ?

Having reviewed this with the Data Engineering team, we would like to create separate tasks for the remainder of the icinga checks than need migrating and close this ticket, if that's acceptable.
We've certainly become familiar with alertmanager and it's used for any new threshold checks wherever possible, but we haven't been able to fnd the time to complete the full migration mentioned in the description.
Is this OK with you @fgiunchedi ?

Followup task(s) SGTM @BTullis . May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

Thanks @fgiunchedi - I'll create those follow-ups.

May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

We're hoping to onboard a new SRE into the team in the near future, so have identified the completion of this (phase of the) migration as potentially a good task for a new starter in the team. I just thought it would be more palatable as a smaller set of focused follow-up tickets, rather than this more sizeable tracking ticket.

It'll result in the same work getting done in the end, but hopefully it should be a little less intimdating for a new starter. :-)

Thanks @fgiunchedi - I'll create those follow-ups.

May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

We're hoping to onboard a new SRE into the team in the near future, so have identified the completion of this (phase of the) migration as potentially a good task for a new starter in the team. I just thought it would be more palatable as a smaller set of focused follow-up tickets, rather than this more sizeable tracking ticket.

Makes a lot of sense, feel free to reach out and/or point them in my/o11y direction too when the time comes. Happy to help identify easier/harder checks too.

It'll result in the same work getting done in the end, but hopefully it should be a little less intimdating for a new starter. :-)

Agreed, smaller standalone bits will be easier to chew for sure

BTullis renamed this task from Migrate analytics cluster alerts from Icinga to AlertManager to Migrate the majority of the analytics cluster alerts from Icinga to AlertManager.May 23 2022, 10:09 AM
BTullis updated the task description. (Show Details)

I've created the five follow-up tasks and created bi-directional links to/from this ticket, so now I'll mark this one as done and we can follow up on the remaining tasks in the near future. Thanks again for your patience @fgiunchedi.

@BTullis I'm reopening this since AFAICS the/some icinga checks haven't been removed from puppet, e.g.

modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-yarn-unhealthy-workers':
modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-hdfs-namenode-heap-usage':
modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-yarn-resourcemananager-heap-usage':
modules/profile/manifests/hadoop/master/standby.pp:        monitoring::check_prometheus { 'hadoop-hdfs-namenode-heap-usage':
modules/profile/manifests/hadoop/master/standby.pp:        monitoring::check_prometheus { 'hadoop-yarn-resourcemananager-heap-usage':
modules/profile/manifests/hadoop/worker.pp:        monitoring::check_prometheus { 'analytics_hadoop_hdfs_datanode':
modules/profile/manifests/hadoop/worker.pp:        monitoring::check_prometheus { 'analytics_hadoop_yarn_nodemanager':
modules/profile/manifests/hive/metastore.pp:        monitoring::check_prometheus { 'hive-metastore-heap-usage':
modules/profile/manifests/hive/server.pp:        monitoring::check_prometheus { 'hive-server-heap-usage':

Thanks @fgiunchedi - I will rebase this and fix merge conflicts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/744809

I confess that it has been sitting in the queue in Gerrit for months, but I'll fix it.

Change 744809 merged by Btullis:

[operations/puppet@production] Remove more alerts that have moved to alertmanager

https://gerrit.wikimedia.org/r/744809

Change 813833 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove trailing check_promethus checks for hadoop

https://gerrit.wikimedia.org/r/813833

Change 813833 merged by Btullis:

[operations/puppet@production] Remove trailing check_promethus checks for hadoop

https://gerrit.wikimedia.org/r/813833

I'm re-resolving this ticket now @fgiunchedi - as I think those checks you identified are all removed from Icinga now.
I'd like to get some time to work on the follow-up tickets myself as well, but we're also considering them as good onboarding tasks for someone.