Setup Analytics team in VO/splunk oncall
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jan 27 2021, 2:57 PM

Description

This task is in the context of fully migrating off the old paging system (i.e. SMS sent by icinga via a 3rd party provider) and onto VictorOps / Splunk Oncall.

For Analytics specifically we'll need to at do/coordinate these steps:

Onboard to VO all Analytics folks that need paging, and have them be part of the Analytics VO team (already present)
Create a rotation and escalation for the team, and assign a new routing key (for the icinga integration)
Make sure team onboarding documentation is updated to reflect VO onboarding and stop changing icinga contacts.cfg file (note that cgi.cfg will still need modification for web interface permissions tho)

Details

Subject	Repo	Branch	Lines +/-
alerts: add victorops paging for hadoop master and kafka broker	operations/puppet	production	+14 -11
superset: Add victorops alerting for superset	operations/puppet	production	+1 -2
nagios: add victorops-analytics contact group	operations/puppet	production	+5 -0
superset: Temporarily rollback victorops-analytics contact_group	operations/puppet	production	+2 -1
superset: add victorops contact to superset monitoring	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
					Restricted Task
		Resolved		BTullis	T273064 Setup Analytics team in VO/splunk oncall

Event Timeline

fgiunchedi created this task.Jan 27 2021, 2:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2021, 2:57 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Jan 29 2021, 1:55 PM

Ottomata added a project: Analytics.Feb 2 2021, 5:01 PM

Ottomata added subscribers: • razzi, elukey.

@razzi I've invited you to victorops, you should have received an email. Please follow the instructions at https://wikitech.wikimedia.org/wiki/VictorOps#Set_up_as_a_new_user and let us know if you run into an problems!

I believe with this all SRE in Analytics are in VO. My understanding is that currently there's no automated pages from icinga towards Analytics SRE, neverless we can setup we can set up a "batphone" style rotation for manual paging (e.g. via email) and the icinga integration (but no automatic pages will be sent, as it is now).

fgiunchedi updated the task description. (Show Details)Feb 4 2021, 1:45 PM

@razzi @elukey @Ottomata I'd like to go over the VO setup with you and see what makes the most sense, is next week ok for a 45 min chat ?

Ya!

I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.

It considers any Icinga contact list with fewer than 5 members invalid: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/refs/heads/master/icinga/check_icinga_contacts.schema#13

As of today's patch, that seems to be down to 4 :)

We perhaps need to update meta-monitoring's idea of the world entirely?

In T273064#6805181, @CDanis wrote:

I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.

It considers any Icinga contact list with fewer than 5 members invalid: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/refs/heads/master/icinga/check_icinga_contacts.schema#13

As of today's patch, that seems to be down to 4 :)

We perhaps need to update meta-monitoring's idea of the world entirely?

Indeed! Filed as T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Feb 5 2021, 1:41 PM

fgiunchedi moved this task from Inbox to In progress on the observability board.Feb 8 2021, 4:17 PM

• fdans edited projects, added Analytics-Clusters; removed Analytics.Mar 1 2021, 4:48 PM

hi @Ottomata @razzi @elukey, did you had a chance to look into setting things up in VO? Please let us know if you need assistance.

Talked about this today, bare minimum of what we should do now:

Set up a paging schedule for analytics SREs in Splunk OnCall.
make sure existent alerts with analytics as contact_group will forward to Splunk OnCall

Ottomata triaged this task as Medium priority.Mar 9 2021, 5:26 PM

Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Mar 16 2021, 8:12 AM

@fgiunchedi I have created a paging schedule meant to mirror Analytics SRE's waking hours with escalation policy "Test Escalation Policy" (we can rename once things are configured). I see we are meant to create a new routing key but it looks like I'm unable to modify the Settings > Routing Keys page. Is that something you can do?

In T273064#6920071, @razzi wrote:

@fgiunchedi I have created a paging schedule meant to mirror Analytics SRE's waking hours with escalation policy "Test Escalation Policy" (we can rename once things are configured). I see we are meant to create a new routing key but it looks like I'm unable to modify the Settings > Routing Keys page. Is that something you can do?

Yes indeed, I've set up a routing key analytics linked to the test escalation policy (I _think_ renaming of the policy can be done self-service by you when needed). I've also added a victorops-analytics contact to Icinga, adding that contact to any required contactgroup will then start the escalation, hope that helps!

Ottomata added a project: Analytics-Kanban.Mar 29 2021, 3:48 PM

Ok thanks @fgiunchedi. I'll try adding alerting to the superset service for starters.

Change 675898 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: add victorops contact to superset monitoring

https://gerrit.wikimedia.org/r/675898

gerritbot added a project: Patch-For-Review.Mar 30 2021, 8:25 PM

fgiunchedi moved this task from In progress to Radar on the observability board.Apr 6 2021, 2:41 PM

• razzi moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 6 2021, 7:45 PM

Change 675898 merged by Razzi:

[operations/puppet@production] superset: add victorops contact to superset monitoring

https://gerrit.wikimedia.org/r/675898

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2021, 8:10 PM

Change 677362 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Temporarily rollback victorops-analytics contact_group

https://gerrit.wikimedia.org/r/677362

gerritbot added a project: Patch-For-Review.Apr 6 2021, 8:48 PM

Change 677362 merged by Razzi:

[operations/puppet@production] superset: Temporarily rollback victorops-analytics contact_group

https://gerrit.wikimedia.org/r/677362

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2021, 9:12 PM

Change 677617 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] nagios: add victorops-analytics contact group

https://gerrit.wikimedia.org/r/677617

gerritbot added a project: Patch-For-Review.Apr 7 2021, 4:54 PM

Change 677617 merged by Razzi:

[operations/puppet@production] nagios: add victorops-analytics contact group

https://gerrit.wikimedia.org/r/677617

Change 677642 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Add victorops alerting for superset

https://gerrit.wikimedia.org/r/677642

Change 677642 merged by Razzi:

[operations/puppet@production] superset: Add victorops alerting for superset

https://gerrit.wikimedia.org/r/677642

Alright! @fgiunchedi I added the alert to superset, and when it alerted on Icinga, @Ottomata and I got an alert from Splunk OnCall (text message, but now I've customized it to be a push notification).

One snag I ran into was that Icinga needed the victorops-analytics contactgroup defined; I did so here. Without that defined, icinga gave an error until I rolled the change back. I feel like that's worth documenting but I'm not sure where; @fgiunchedi do you have an idea?

Now I suppose it's up to us, @Ottomata and @elukey, to add our victorops-analytics contact group to our checks. Should we add them everywhere, or be more specific about what should page?

In T273064#7001382, @razzi wrote:

Alright! @fgiunchedi I added the alert to superset, and when it alerted on Icinga, @Ottomata and I got an alert from Splunk OnCall (text message, but now I've customized it to be a push notification).

Nice! Glad to know things are working as expected.

One snag I ran into was that Icinga needed the victorops-analytics contactgroup defined; I did so here. Without that defined, icinga gave an error until I rolled the change back. I feel like that's worth documenting but I'm not sure where; @fgiunchedi do you have an idea?

Indeed I forgot about the contactgroup/contact indirection. Good question re: documenting, I'm not sure either off the bat, though we're migrating off icinga in favor of alertmanager, perhaps good enough as it is.

fgiunchedi updated the task description. (Show Details)Apr 15 2021, 9:46 AM

Ottomata moved this task from Q3 2020/2021 to Q4 2020/2021 on the Analytics-Clusters board.Apr 19 2021, 3:54 PM

Change 681420 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] alerts: add victorops paging for hadoop master and kafka broker

https://gerrit.wikimedia.org/r/681420

In our ops sync we decided to add victorops alerting for critical alerts, and I've started adding them to puppet. In the case that the alert is conditionally critical, such as:

nrpe::monitor_service { 'kafka':
    description   => 'Kafka Broker Server',
    nrpe_command  => '/usr/lib/nagios/plugins/check_procs -c 1:1 -C java -a "Kafka /etc/kafka/server.properties"',
    critical      => $is_critical,
    contact_group => 'victorops-analytics',  # I added this locally
    notes_url     => 'https://wikitech.wikimedia.org/wiki/Kafka/Administration',
}

Do we want to make the contact group conditionally include victorops-analytics? If so, I feel like there has to be a better way than having

if $is_critical {
  $contact_group = 'victorops-analytics'
} else {
  $contact_group = ''
}

all over the place.

For the specific problem I think you could also use a case switch (I think preferably using hiera variable like Andrew suggested in the review, similar to is_critical). HTH!

• razzi added a project: User-razzi.Jun 10 2021, 11:07 PM

I have requested a VO/Splunk account in T286028.

• razzi moved this task from Default to Ready for action on the User-razzi board.Jul 19 2021, 8:27 PM

• razzi moved this task from Q4 2020/2021 to Q1 2021/2022 on the Analytics-Clusters board.Jul 21 2021, 4:59 PM

There's more work to be done here, but it's been sitting in my personal backlog for a while so I'm going to release it back to the pool. The tricky bit is to gind all analytics alerts (nrpe::monitor_service) that have "critical" set. Note that this is not the same as alerts with level (info, warn, error, critical etc) set to CRITICAL.

• razzi removed a project: User-razzi.Aug 17 2021, 4:14 PM

fgiunchedi removed a project: User-fgiunchedi.Aug 18 2021, 8:11 AM

• razzi moved this task from In Progress to Next Up on the Analytics-Kanban board.Aug 18 2021, 4:09 PM

BTullis claimed this task.Sep 15 2021, 1:27 PM

BTullis reassigned this task from BTullis to • razzi.Sep 15 2021, 5:23 PM

• razzi changed the task status from Open to In Progress.Sep 16 2021, 3:26 PM

• razzi moved this task from Next Up to In Progress on the Analytics-Kanban board.

• razzi moved this task from In Progress to Paused on the Analytics-Kanban board.Oct 20 2021, 4:11 PM

Ottomata reassigned this task from • razzi to BTullis.Oct 25 2021, 3:36 PM

BTullis moved this task from Q1 2021/2022 to Q2 2021/2022 on the Analytics-Clusters board.Oct 25 2021, 3:36 PM

odimitrijevic added projects: Data-Engineering-Kanban, Data-Engineering.Oct 28 2021, 5:08 AM

odimitrijevic moved this task from Next Up to Paused on the Data-Engineering-Kanban board.

BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Oct 28 2021, 5:05 PM

BTullis moved this task from Paused to In Progress on the Data-Engineering-Kanban board.Nov 3 2021, 12:35 PM

Quite a lot of changes have already been made to thie VictorOps integration because of the work done in: T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager
Namely that a lot of the monitoring::check_prometheus checks have already been moved to Alertmanager and a special page severity level has been defined.

So far only one alert has been configured for paging. That is HDFS capacity on the production cluster of less than 5% free.
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-engineering/hadoop-hdfs.yaml#6

I have more work to complete the migration of prometheus-based checks for Eventlogging, Eventgate, Kafka, Zookeeper and the labstore MySQL hosts, but several of these checks may well include a paging severity.

That leaves the following types of checks that are in Icinga:

monitoring::service

nrpe::monitor_systemd_unit_state - which creates an object of type:
nrpe::monitor_service - which creates an object of type:
nrpe::check

I agree with @Ottomata's comment here that we probably need to make use of the $contact_group parameter, along with the $is_critical parameter, to limit pages to the right engineers.

However, it's a bit messy because both the monitoring::service and nrpe::monitor_service types currently perform a lookup to hiera in their definitions to obtain the contact groups.

c.f. here for monitoring::service

String $contact_group = lookup('contactgroups', {'default_value' => 'admins'}), # FIXME, defines should not have calls to hiera/lookup

and here for nrpe::monitor_service

$contact_group = lookup('contactgroups', {default_value => 'admins'}),

Furthermore, this pattern is also in use for the service::node and service::uwsgi defined types.

Given all of that, I'm not entirely sure that we have the granularity we need to achieve our aims without a more comprehensive rewrite of the Icinga monitoring types.

If I understand it at present:

We can set contactgroups for a host or role in hiera, but that will apply those contactgroups to all defined Icinga checks on the affected host.
We can set contactgroups in each profile (as @razzi did in the latest patch) but that will apply to all roles which define the profile (e.g. kafka-jumbo, kafka-logging etc.)

Therefore, I propose that we wrap up this ticket and create one or more follow-up tickets to do the following:

Perform a thorough inventory of the analytics related checks and alerts
Decide on which require paging and configure them accordingly
Decide on which contact groups should be applied
Configure this state in puppet, or refactor the monitoring defined types accordingly

Some things are already possible to configure correctly, such as the Superset service, which is not applied by any other teams or locations. Therefore it's easy to set this in the manifest, but the corner cases where we share profiles with other teams, or certain checks (e.g. RAID status) should always go to a particular team, will take some more work.

I may have misunderstood something important about the current setup, in which case I'm happy for anyone else to advise on a different way forward.

The escalation policy is already in place in VictorOps. I have replaced @elukey on the Europe daytime shift of the Analytics SRE - Europe rotation policy. Currently it's based on a follow-the sun policy where we're on-call 7 days per week, but we can come back to this later on if and when we need to adjust it.

All three of us have at least a mobile number in VictorOps, so we will be called in case of a paging level alert.

Change 681420 abandoned by Btullis:

[operations/puppet@production] alerts: add victorops paging for hadoop master and kafka broker

Reason:

Will create a new CR for this, based on a follow-up ticket.

https://gerrit.wikimedia.org/r/681420

BTullis updated the task description. (Show Details)Nov 26 2021, 4:39 PM

BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.

BTullis mentioned this in T296552: Define a list of exactly which alerts should page the Analytics team in VictorOps.Nov 26 2021, 4:59 PM

BTullis moved this task from Q2 2021/2022 to Done on the Analytics-Clusters board.Dec 1 2021, 5:29 PM

BTullis closed this task as Resolved.Dec 6 2021, 9:32 AM

	F34770049: image.png
	Nov 26 2021, 4:01 PM

Setup Analytics team in VO/splunk oncallClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Setup Analytics team in VO/splunk oncall
Closed, ResolvedPublic
Actions

Related Objects
Search...