Page MenuHomePhabricator

Outstanding icinga critical on cloudcontrol-dev hosts
Closed, ResolvedPublic0 Story Points

Event Timeline

ayounsi triaged this task as Low priority.May 30 2019, 6:35 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2019, 6:35 PM
ayounsi updated the task description. (Show Details)May 30 2019, 6:36 PM
bd808 assigned this task to aborrero.May 30 2019, 6:59 PM
bd808 added subscribers: JHedden, bd808.

I think that @aborrero may be the best person to figure out how to fix these. Arturo, feel free to pull @JHedden in to do things with these hosts too if you have time to show him around a bit. Testing clusters are a great place to start breaking^Wfixing things.

Sure! Adding some context for @JHedden:

when we were building the codfw1dev cluster (that shares some amount of puppet code with eqiad1) we were getting paged because services not properly running, unconfigured stuff, etc. There was a time when we even didn't have a backing DB for openstack in the codfw1dev deployment.
(More about our openstack deployments here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments) This codfw1dev deployment is fairly new. We didn't have it 2 months ago (we had other 2 deployments in codfw before, labtest and labtestn, now they don't exist).

To reduce the amount of noise we were getting, in both the form of pages and IRC/email notifications, I merged this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/503345
For each role in the patch, I introduced a hiera key to disable the notifications managed by the profile::base puppet class (file modules/profile/manifests/base.pp).

Not sure if you are familiar with puppet already, but please note the relationship between the hiera file and the affected role:

hieradata/role/common/wmcs/openstack/codfw1dev/control.yaml  <-- hiera keys defined in this file
role::wmcs::openstack::codfw1dev::control                    <-- are applied to servers that have this role applied

I think the proper fix for this task is to:

  • understand how notification works (I don't fully understand them yet!), by understanding the relationship between profile::base::notifications, icinga, and defined checks for a given server/service.
  • figure out a proper value for the hiera keys, a value that results in servers producing alerts in IRC/email only and not pages (since this is the -dev cluster)
  • check which alerts are true/false positives, and fix them, i.e, get the openstack deployment in shape by fixing any remaining missing configuration bits (remember, is a new deployment!)
ayounsi removed a subscriber: ayounsi.May 31 2019, 1:49 PM

Thanks for the information @aborrero. I'll review the puppet data. I don't have access to Icinga or the hosts yet, so if you get to a point where I can shadow you please let me know.

Change 519656 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] pdns: add rec_control profile to codfw1dev

https://gerrit.wikimedia.org/r/519656

Change 519656 merged by Jhedden:
[operations/puppet@production] pdns: add rec_control profile to codfw1dev

https://gerrit.wikimedia.org/r/519656

  • understand how notification works (I don't fully understand them yet!), by understanding the relationship between profile::base::notifications, icinga, and defined checks for a given server/service.

I think the best approach to handling this with Icinga is to downtime hosts and/or disable active checks for the services we're working on. This will help reduce noise for both notifications and dashboard visibility.

  • figure out a proper value for the hiera keys, a value that results in servers producing alerts in IRC/email only and not pages (since this is the -dev cluster)

Notifications are disabled, but cloudservices2002-dev.wikimedia.org is configured to alert via IRC only. It is not configured for SMS paging.

  • check which alerts are true/false positives, and fix them, i.e, get the openstack deployment in shape by fixing any remaining missing configuration bits (remember, is a new deployment!)

These alerts were all relevant and now fixed.

aborrero closed this task as Resolved.Jul 29 2019, 10:09 AM
aborrero reassigned this task from aborrero to JHedden.

I think this is done.