Page MenuHomePhabricator

Setup Analytics team in VO/splunk oncall
Closed, ResolvedPublic

Description

This task is in the context of fully migrating off the old paging system (i.e. SMS sent by icinga via a 3rd party provider) and onto VictorOps / Splunk Oncall.

For Analytics specifically we'll need to at do/coordinate these steps:

  • Onboard to VO all Analytics folks that need paging, and have them be part of the Analytics VO team (already present)
  • Create a rotation and escalation for the team, and assign a new routing key (for the icinga integration)
  • Make sure team onboarding documentation is updated to reflect VO onboarding and stop changing icinga contacts.cfg file (note that cgi.cfg will still need modification for web interface permissions tho)

Event Timeline

@razzi I've invited you to victorops, you should have received an email. Please follow the instructions at https://wikitech.wikimedia.org/wiki/VictorOps#Set_up_as_a_new_user and let us know if you run into an problems!

I believe with this all SRE in Analytics are in VO. My understanding is that currently there's no automated pages from icinga towards Analytics SRE, neverless we can setup we can set up a "batphone" style rotation for manual paging (e.g. via email) and the icinga integration (but no automatic pages will be sent, as it is now).

@razzi @elukey @Ottomata I'd like to go over the VO setup with you and see what makes the most sense, is next week ok for a 45 min chat ?

I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.

It considers any Icinga contact list with fewer than 5 members invalid: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/refs/heads/master/icinga/check_icinga_contacts.schema#13

As of today's patch, that seems to be down to 4 :)

We perhaps need to update meta-monitoring's idea of the world entirely?

I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.

It considers any Icinga contact list with fewer than 5 members invalid: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/refs/heads/master/icinga/check_icinga_contacts.schema#13

As of today's patch, that seems to be down to 4 :)

We perhaps need to update meta-monitoring's idea of the world entirely?

Indeed! Filed as T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts

hi @Ottomata @razzi @elukey, did you had a chance to look into setting things up in VO? Please let us know if you need assistance.

Talked about this today, bare minimum of what we should do now:

  • Set up a paging schedule for analytics SREs in Splunk OnCall.
  • make sure existent alerts with analytics as contact_group will forward to Splunk OnCall
Ottomata triaged this task as Medium priority.Mar 9 2021, 5:26 PM
Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.

@fgiunchedi I have created a paging schedule meant to mirror Analytics SRE's waking hours with escalation policy "Test Escalation Policy" (we can rename once things are configured). I see we are meant to create a new routing key but it looks like I'm unable to modify the Settings > Routing Keys page. Is that something you can do?

@fgiunchedi I have created a paging schedule meant to mirror Analytics SRE's waking hours with escalation policy "Test Escalation Policy" (we can rename once things are configured). I see we are meant to create a new routing key but it looks like I'm unable to modify the Settings > Routing Keys page. Is that something you can do?

Yes indeed, I've set up a routing key analytics linked to the test escalation policy (I _think_ renaming of the policy can be done self-service by you when needed). I've also added a victorops-analytics contact to Icinga, adding that contact to any required contactgroup will then start the escalation, hope that helps!

Ok thanks @fgiunchedi. I'll try adding alerting to the superset service for starters.

Change 675898 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: add victorops contact to superset monitoring

https://gerrit.wikimedia.org/r/675898

Change 675898 merged by Razzi:

[operations/puppet@production] superset: add victorops contact to superset monitoring

https://gerrit.wikimedia.org/r/675898

Change 677362 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Temporarily rollback victorops-analytics contact_group

https://gerrit.wikimedia.org/r/677362

Change 677362 merged by Razzi:

[operations/puppet@production] superset: Temporarily rollback victorops-analytics contact_group

https://gerrit.wikimedia.org/r/677362

Change 677617 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] nagios: add victorops-analytics contact group

https://gerrit.wikimedia.org/r/677617

Change 677617 merged by Razzi:

[operations/puppet@production] nagios: add victorops-analytics contact group

https://gerrit.wikimedia.org/r/677617

Change 677642 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Add victorops alerting for superset

https://gerrit.wikimedia.org/r/677642

Change 677642 merged by Razzi:

[operations/puppet@production] superset: Add victorops alerting for superset

https://gerrit.wikimedia.org/r/677642

Alright! @fgiunchedi I added the alert to superset, and when it alerted on Icinga, @Ottomata and I got an alert from Splunk OnCall (text message, but now I've customized it to be a push notification).

One snag I ran into was that Icinga needed the victorops-analytics contactgroup defined; I did so here. Without that defined, icinga gave an error until I rolled the change back. I feel like that's worth documenting but I'm not sure where; @fgiunchedi do you have an idea?

Now I suppose it's up to us, @Ottomata and @elukey, to add our victorops-analytics contact group to our checks. Should we add them everywhere, or be more specific about what should page?

Alright! @fgiunchedi I added the alert to superset, and when it alerted on Icinga, @Ottomata and I got an alert from Splunk OnCall (text message, but now I've customized it to be a push notification).

Nice! Glad to know things are working as expected.

One snag I ran into was that Icinga needed the victorops-analytics contactgroup defined; I did so here. Without that defined, icinga gave an error until I rolled the change back. I feel like that's worth documenting but I'm not sure where; @fgiunchedi do you have an idea?

Indeed I forgot about the contactgroup/contact indirection. Good question re: documenting, I'm not sure either off the bat, though we're migrating off icinga in favor of alertmanager, perhaps good enough as it is.

Change 681420 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] alerts: add victorops paging for hadoop master and kafka broker

https://gerrit.wikimedia.org/r/681420

In our ops sync we decided to add victorops alerting for critical alerts, and I've started adding them to puppet. In the case that the alert is conditionally critical, such as:

nrpe::monitor_service { 'kafka':
    description   => 'Kafka Broker Server',
    nrpe_command  => '/usr/lib/nagios/plugins/check_procs -c 1:1 -C java -a "Kafka /etc/kafka/server.properties"',
    critical      => $is_critical,
    contact_group => 'victorops-analytics',  # I added this locally
    notes_url     => 'https://wikitech.wikimedia.org/wiki/Kafka/Administration',
}

Do we want to make the contact group conditionally include victorops-analytics? If so, I feel like there has to be a better way than having

if $is_critical {
  $contact_group = 'victorops-analytics'
} else {
  $contact_group = ''
}

all over the place.

For the specific problem I think you could also use a case switch (I think preferably using hiera variable like Andrew suggested in the review, similar to is_critical). HTH!

I have requested a VO/Splunk account in T286028.

There's more work to be done here, but it's been sitting in my personal backlog for a while so I'm going to release it back to the pool. The tricky bit is to gind all analytics alerts (nrpe::monitor_service) that have "critical" set. Note that this is not the same as alerts with level (info, warn, error, critical etc) set to CRITICAL.

razzi changed the task status from Open to In Progress.Sep 16 2021, 3:26 PM
razzi moved this task from Next Up to In Progress on the Analytics-Kanban board.

Quite a lot of changes have already been made to thie VictorOps integration because of the work done in: T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager
Namely that a lot of the monitoring::check_prometheus checks have already been moved to Alertmanager and a special page severity level has been defined.

So far only one alert has been configured for paging. That is HDFS capacity on the production cluster of less than 5% free.
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-engineering/hadoop-hdfs.yaml#6

I have more work to complete the migration of prometheus-based checks for Eventlogging, Eventgate, Kafka, Zookeeper and the labstore MySQL hosts, but several of these checks may well include a paging severity.

That leaves the following types of checks that are in Icinga:

  • monitoring::service
  • nrpe::monitor_systemd_unit_state - which creates an object of type:
  • nrpe::monitor_service - which creates an object of type:
  • nrpe::check

I agree with @Ottomata's comment here that we probably need to make use of the $contact_group parameter, along with the $is_critical parameter, to limit pages to the right engineers.

However, it's a bit messy because both the monitoring::service and nrpe::monitor_service types currently perform a lookup to hiera in their definitions to obtain the contact groups.

c.f. here for monitoring::service

String $contact_group = lookup('contactgroups', {'default_value' => 'admins'}), # FIXME, defines should not have calls to hiera/lookup

and here for nrpe::monitor_service

$contact_group = lookup('contactgroups', {default_value => 'admins'}),

Furthermore, this pattern is also in use for the service::node and service::uwsgi defined types.

Given all of that, I'm not entirely sure that we have the granularity we need to achieve our aims without a more comprehensive rewrite of the Icinga monitoring types.

If I understand it at present:

  • We can set contactgroups for a host or role in hiera, but that will apply those contactgroups to all defined Icinga checks on the affected host.
  • We can set contactgroups in each profile (as @razzi did in the latest patch) but that will apply to all roles which define the profile (e.g. kafka-jumbo, kafka-logging etc.)

Therefore, I propose that we wrap up this ticket and create one or more follow-up tickets to do the following:

  • Perform a thorough inventory of the analytics related checks and alerts
  • Decide on which require paging and configure them accordingly
  • Decide on which contact groups should be applied
  • Configure this state in puppet, or refactor the monitoring defined types accordingly

Some things are already possible to configure correctly, such as the Superset service, which is not applied by any other teams or locations. Therefore it's easy to set this in the manifest, but the corner cases where we share profiles with other teams, or certain checks (e.g. RAID status) should always go to a particular team, will take some more work.

I may have misunderstood something important about the current setup, in which case I'm happy for anyone else to advise on a different way forward.

The escalation policy is already in place in VictorOps. I have replaced @elukey on the Europe daytime shift of the Analytics SRE - Europe rotation policy. Currently it's based on a follow-the sun policy where we're on-call 7 days per week, but we can come back to this later on if and when we need to adjust it.

image.png (648×1 px, 85 KB)

All three of us have at least a mobile number in VictorOps, so we will be called in case of a paging level alert.

Change 681420 abandoned by Btullis:

[operations/puppet@production] alerts: add victorops paging for hadoop master and kafka broker

Reason:

Will create a new CR for this, based on a follow-up ticket.

https://gerrit.wikimedia.org/r/681420