⚓ T293399 Migrate the majority of the analytics cluster alerts from Icinga to AlertManager

Subject	Repo	Branch	Lines +/-
Remove trailing check_promethus checks for hadoop	operations/puppet	production	+0 -27
Remove more alerts that have moved to alertmanager	operations/puppet	production	+0 -78
Remove test hosts from the JVM heap memory alerts	operations/alerts	master	+6 -6
Remove duplicate cluster variable from Druid check	operations/alerts	master	+4 -4
Remove the HDFS corrupt blocks check from Icinga	operations/puppet	production	+0 -13
Remove three more HDFS checks from Icinga	operations/puppet	production	+0 -33
Add more alerts to the data-engineering team	operations/alerts	master	+408 -0
Update the way that the unavailable druid segment alert works	operations/alerts	master	+11 -19
Add the first eventgate alert to Alertmanager	operations/alerts	master	+67 -0
Add checks for druid datasources to alertmanager	operations/alerts	master	+129 -0
Remove prometheus based Druid checks	operations/puppet	production	+0 -41
Add three more HDFS related checks to alertmanager	operations/alerts	master	+154 -1
Add an alert for HDFS corrupt blocks	operations/alerts	master	+51 -0
Remove HDFS Capacity Remaining check from Icinga	operations/puppet	production	+0 -13
Correct the team-data-engineering alerts	operations/alerts	master	+42 -24
Add the first data-engineering team alert to Alertmanager	operations/alerts	master	+62 -0
Add the data-engineering team to Alertmanager	operations/puppet	production	+37 -0

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Resolved	BTullis	T293399 Migrate the majority of the analytics cluster alerts from Icinga to AlertManager

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2021, 6:06 PM

BTullis added a project: Analytics-Kanban.Oct 15 2021, 3:05 PM

BTullis moved this task from Next Up to In Progress on the Analytics-Kanban board.

Doing a quick search of the puppet repo for these alerts reveals the following checks that I think are in scope.

Master checks:

hadoop-hdfs-capacity-remaining-percent
hadoop-hdfs-corrupt-blocks
hadoop-hdfs-missing-blocks
hadoop-hdfs-total-files-heap
hadoop-yarn-unhealthy-workers
hadoop-hdfs-namenode-heap-usage
hadoop-yarn-resourcemananager-heap-usage

Worker checks:

analytics_hadoop_hdfs_datanode
analytics_hadoop_yarn_nodemanager

Standby master checks:

hadoop-hdfs-namenode-heap-usage
hadoop-yarn-resourcemananager-heap-usage

Hive checks:

hive-metastore-heap-usage
hive-server-heap-usage

Druid checks:

druid_netflow_supervisor
druid_coordinator_segments_unavailable_analytics
druid_coordinator_segments_unavailable_public

Eventlogging checks:

eventlogging_EventError_throughput
eventlogging_NavigationTiming_throughput
eventlogging_throughput
eventlogging_processors_kafka_lag
Each of: ${eventgate_service}_validation_error_rate
Each of: eventgate_logging_external_latency_${site}
Each of: eventgate_logging_external_errors_${site}

I haven't included the following, although they could perhaps be in scope. I will check before taking any action with regard to these.

Kafka checks
Labstore checks
Zookeeper checks

odimitrijevic triaged this task as High priority.Oct 18 2021, 4:31 PM

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Change 731884 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the data-engineering team to Alertmanager

https://gerrit.wikimedia.org/r/731884

gerritbot added a project: Patch-For-Review.Oct 19 2021, 9:26 AM

Change 731919 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add the first data-engineering team alert to Alertmanager

https://gerrit.wikimedia.org/r/731919

Change 731921 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove HDFS Capacity Remaining check from Icinga

https://gerrit.wikimedia.org/r/731921

I have made some progress on the migration by carrying out the following:

created a CR to add the data-engineering team to Alertmanager: https://gerrit.wikimedia.org/r/731884
created a CR to add the hadoop-hdfs-capacity-remaining-percent alert to Alertmanager: https://gerrit.wikimedia.org/r/731919
created a CR to remove the hadoop-hdfs-capacity-remaining-percent check from icinga: https://gerrit.wikimedia.org/r/731921

They need to be deployed in this order and the icinga check shouldn't be deployed until the alertmanager check of the free capacity has been suitably tested.

Change 731884 merged by Btullis:

[operations/puppet@production] Add the data-engineering team to Alertmanager

https://gerrit.wikimedia.org/r/731884

Change 731919 merged by jenkins-bot:

[operations/alerts@master] Add the first data-engineering team alert to Alertmanager

https://gerrit.wikimedia.org/r/731919

BTullis mentioned this in rOALEf6c799d9fa0b: Add the first data-engineering team alert to Alertmanager.Oct 20 2021, 10:01 AM

BTullis updated the task description. (Show Details)Oct 20 2021, 1:30 PM

Change 732623 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Correct the team-data-engineering file names

https://gerrit.wikimedia.org/r/732623

Change 732623 merged by jenkins-bot:

[operations/alerts@master] Correct the team-data-engineering alerts

https://gerrit.wikimedia.org/r/732623

BTullis mentioned this in rOALE5918f22c6a3e: Correct the team-data-engineering alerts.Oct 21 2021, 3:11 PM

I have confirmed the presence of the first rule that has been added, by using an SSH tunnel and checking the Prometheus web interface.

Moving onto the next set of rules now.

BTullis updated the task description. (Show Details)Oct 21 2021, 3:31 PM

Change 731921 merged by Btullis:

[operations/puppet@production] Remove HDFS Capacity Remaining check from Icinga

https://gerrit.wikimedia.org/r/731921

Maintenance_bot removed a project: Patch-For-Review.Oct 21 2021, 4:10 PM

BTullis mentioned this in T287027: Create aggregate alarms for Hadoop daemons running on worker nodes.Oct 21 2021, 5:17 PM

Change 732748 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add an alert for HDFS corrupt blocks

https://gerrit.wikimedia.org/r/732748

gerritbot added a project: Patch-For-Review.Oct 21 2021, 5:32 PM

Change 732922 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the HDFS corrupt blocks check from Icinga

https://gerrit.wikimedia.org/r/732922

Change 732748 merged by jenkins-bot:

[operations/alerts@master] Add an alert for HDFS corrupt blocks

https://gerrit.wikimedia.org/r/732748

BTullis mentioned this in rOALEe0ab8a18708e: Add an alert for HDFS corrupt blocks.Oct 22 2021, 8:55 AM

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Oct 22 2021, 11:51 AM

BTullis updated the task description. (Show Details)Oct 22 2021, 2:52 PM

Change 732993 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add three more HDFS related checks to alertmanager

https://gerrit.wikimedia.org/r/732993

Change 732993 merged by jenkins-bot:

[operations/alerts@master] Add three more HDFS related checks to alertmanager

https://gerrit.wikimedia.org/r/732993

BTullis mentioned this in rOALEbef812519bfa: Add three more HDFS related checks to alertmanager.Oct 26 2021, 2:27 PM

Change 734662 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove three more HDFS checks from Icinga

https://gerrit.wikimedia.org/r/734662

BTullis updated the task description. (Show Details)Oct 27 2021, 3:13 PM

odimitrijevic added a project: Data-Engineering-Kanban.Oct 27 2021, 11:50 PM

odimitrijevic moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Oct 27 2021, 11:57 PM

lmata awarded a token.Oct 27 2021, 11:59 PM

Something that occurred to me while reading the list: (thank you @BTullis for putting that together!) the "same" check for different systems could be grouped together too. For example "jvm heap usage" could be one alert rule with appropriate selectors per-service, and not three different rules (with the same alert name). Hope that helps! Not a requirement but something I thought of

BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Oct 28 2021, 5:12 PM

Change 735669 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add more alerts to the data-engineering team

https://gerrit.wikimedia.org/r/735669

BTullis updated the task description. (Show Details)Nov 1 2021, 10:19 AM

BTullis updated the task description. (Show Details)Nov 1 2021, 11:45 AM

Change 736279 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add checks for druid datasources to alertmanager

https://gerrit.wikimedia.org/r/736279

Change 736280 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove prometheus based Druid checks

https://gerrit.wikimedia.org/r/736280

BTullis updated the task description. (Show Details)Nov 2 2021, 4:52 PM

BTullis mentioned this in T294911: Apparent latency warning in 90th centile of eventgate-logging-external.Nov 3 2021, 11:53 AM

I am making significant progress on this now and have several more CRs to merge.

Al HDFS and Yarn related checks have been addressed, along with the JVM memory usage checks.
I have also created patches to migrate the Druid checks.

I have discovered an issue in the current monitoring of the Eventgate latency, which I have descibed in T294911: Apparent latency warning in 90th centile of eventgate-logging-external so I will leave this check for now and move on to the eventlogging related checks.

Change 736490 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add the first eventgate alert to Alertmanager

https://gerrit.wikimedia.org/r/736490

Change 736279 merged by Btullis:

[operations/alerts@master] Add checks for druid datasources to alertmanager

https://gerrit.wikimedia.org/r/736279

Change 736280 merged by Btullis:

[operations/puppet@production] Remove prometheus based Druid checks

https://gerrit.wikimedia.org/r/736280

Change 736490 merged by jenkins-bot:

[operations/alerts@master] Add the first eventgate alert to Alertmanager

https://gerrit.wikimedia.org/r/736490

Change 740128 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Update the way that the unavailable druid segment alert works

https://gerrit.wikimedia.org/r/740128

Change 740128 merged by Btullis:

[operations/alerts@master] Update the way that the unavailable druid segment alert works

https://gerrit.wikimedia.org/r/740128

Change 735669 merged by jenkins-bot:

[operations/alerts@master] Add more alerts to the data-engineering team

https://gerrit.wikimedia.org/r/735669

BTullis mentioned this in T273064: Setup Analytics team in VO/splunk oncall.Nov 26 2021, 3:44 PM

Change 734662 merged by Btullis:

[operations/puppet@production] Remove three more HDFS checks from Icinga

https://gerrit.wikimedia.org/r/734662

Change 732922 merged by Btullis:

[operations/puppet@production] Remove the HDFS corrupt blocks check from Icinga

https://gerrit.wikimedia.org/r/732922

Change 744809 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove more alerts that have moved to alertmanager

https://gerrit.wikimedia.org/r/744809

BTullis updated the task description. (Show Details)Dec 7 2021, 3:14 PM

Change 744813 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Remove duplicate cluster variable from Druid check

https://gerrit.wikimedia.org/r/744813

Change 744813 merged by jenkins-bot:

[operations/alerts@master] Remove duplicate cluster variable from Druid check

https://gerrit.wikimedia.org/r/744813

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:29 AM

lmata edited projects, added SRE Observability; removed SRE Observability (FY2021/2022-Q2).Jan 17 2022, 4:11 PM

lmata moved this task from FY2021/2022-Q2 to Inbox on the SRE Observability board.

BTullis moved this task from In Progress to Paused on the Data-Engineering-Kanban board.Jan 18 2022, 4:16 PM

BTullis mentioned this in T300246: Add alert for varnishkafka low/zero messages per second to alertmanager.Jan 27 2022, 10:58 AM

Change 776919 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Remove an-test-coord* from the Hive JVM heap memory alerts

https://gerrit.wikimedia.org/r/776919

Change 776919 merged by jenkins-bot:

[operations/alerts@master] Remove test hosts from the JVM heap memory alerts

https://gerrit.wikimedia.org/r/776919

Having reviewed this with the Data Engineering team, we would like to create separate tasks for the remainder of the icinga checks than need migrating and close this ticket, if that's acceptable.
We've certainly become familiar with alertmanager and it's used for any new threshold checks wherever possible, but we haven't been able to fnd the time to complete the full migration mentioned in the description.
Is this OK with you @fgiunchedi ?

In T293399#7944047, @BTullis wrote:

Having reviewed this with the Data Engineering team, we would like to create separate tasks for the remainder of the icinga checks than need migrating and close this ticket, if that's acceptable.
We've certainly become familiar with alertmanager and it's used for any new threshold checks wherever possible, but we haven't been able to fnd the time to complete the full migration mentioned in the description.
Is this OK with you @fgiunchedi ?

Followup task(s) SGTM @BTullis . May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

Thanks @fgiunchedi - I'll create those follow-ups.

May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

We're hoping to onboard a new SRE into the team in the near future, so have identified the completion of this (phase of the) migration as potentially a good task for a new starter in the team. I just thought it would be more palatable as a smaller set of focused follow-up tickets, rather than this more sizeable tracking ticket.

It'll result in the same work getting done in the end, but hopefully it should be a little less intimdating for a new starter. :-)

In T293399#7944787, @BTullis wrote:

Thanks @fgiunchedi - I'll create those follow-ups.

May I ask what's the difference on your end of this task vs multiple subtasks? (for my own curiosity/education, either is 100% fine)

We're hoping to onboard a new SRE into the team in the near future, so have identified the completion of this (phase of the) migration as potentially a good task for a new starter in the team. I just thought it would be more palatable as a smaller set of focused follow-up tickets, rather than this more sizeable tracking ticket.

Makes a lot of sense, feel free to reach out and/or point them in my/o11y direction too when the time comes. Happy to help identify easier/harder checks too.

It'll result in the same work getting done in the end, but hopefully it should be a little less intimdating for a new starter. :-)

Agreed, smaller standalone bits will be easier to chew for sure

BTullis mentioned this in T309007: Migrate eventlogging check_prometheus checks to alertmanager.May 23 2022, 9:51 AM

BTullis mentioned this in T309009: Migrate eventgate check_prometheus checks to alertmanager.May 23 2022, 9:53 AM

BTullis mentioned this in T309010: Migrate Kafka prometheus alerts from Icinga to Alertmanager.May 23 2022, 9:57 AM

BTullis mentioned this in T309011: Migrate labstore prometheus alerts from Icinga to Alertmanager.May 23 2022, 10:00 AM

BTullis mentioned this in T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager.May 23 2022, 10:03 AM

BTullis renamed this task from Migrate analytics cluster alerts from Icinga to AlertManager to Migrate the majority of the analytics cluster alerts from Icinga to AlertManager.May 23 2022, 10:09 AM

BTullis updated the task description. (Show Details)

I've created the five follow-up tasks and created bi-directional links to/from this ticket, so now I'll mark this one as done and we can follow up on the remaining tasks in the near future. Thanks again for your patience @fgiunchedi.

fgiunchedi awarded a token.May 23 2022, 12:32 PM

BTullis closed this task as Resolved.May 25 2022, 4:03 PM

fgiunchedi edited parent tasks, added: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively; removed: T281454: Onboard teams with Prometheus-based alerts to AlertManager.Jul 11 2022, 12:09 PM

@BTullis I'm reopening this since AFAICS the/some icinga checks haven't been removed from puppet, e.g.

modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-yarn-unhealthy-workers':
modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-hdfs-namenode-heap-usage':
modules/profile/manifests/hadoop/master.pp:        monitoring::check_prometheus { 'hadoop-yarn-resourcemananager-heap-usage':
modules/profile/manifests/hadoop/master/standby.pp:        monitoring::check_prometheus { 'hadoop-hdfs-namenode-heap-usage':
modules/profile/manifests/hadoop/master/standby.pp:        monitoring::check_prometheus { 'hadoop-yarn-resourcemananager-heap-usage':
modules/profile/manifests/hadoop/worker.pp:        monitoring::check_prometheus { 'analytics_hadoop_hdfs_datanode':
modules/profile/manifests/hadoop/worker.pp:        monitoring::check_prometheus { 'analytics_hadoop_yarn_nodemanager':
modules/profile/manifests/hive/metastore.pp:        monitoring::check_prometheus { 'hive-metastore-heap-usage':
modules/profile/manifests/hive/server.pp:        monitoring::check_prometheus { 'hive-server-heap-usage':

Thanks @fgiunchedi - I will rebase this and fix merge conflicts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/744809

I confess that it has been sitting in the queue in Gerrit for months, but I'll fix it.

Change 744809 merged by Btullis:

[operations/puppet@production] Remove more alerts that have moved to alertmanager

https://gerrit.wikimedia.org/r/744809

Change 813833 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove trailing check_promethus checks for hadoop

https://gerrit.wikimedia.org/r/813833

Change 813833 merged by Btullis:

[operations/puppet@production] Remove trailing check_promethus checks for hadoop

https://gerrit.wikimedia.org/r/813833

I'm re-resolving this ticket now @fgiunchedi - as I think those checks you identified are all removed from Icinga now.
I'd like to get some time to work on the follow-up tickets myself as well, but we're also considering them as good onboarding tasks for someone.

Migrate the majority of the analytics cluster alerts from Icinga to AlertManager
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	F34704609: image.png
	Oct 21 2021, 3:29 PM

Migrate the majority of the analytics cluster alerts from Icinga to AlertManagerClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate the majority of the analytics cluster alerts from Icinga to AlertManager
Closed, ResolvedPublic
Actions

Related Objects
Search...