Page MenuHomePhabricator

Add Reading Infrastructure engineers to contacts for mobileapps
Closed, ResolvedPublic

Description

Please add the following users to the icinga contacts for mobileapps:

@Mholloway (mholloway-shell), @bearND (bsitzmann), @Tgr (tgr)

Event Timeline

Dzahn moved this task from In progress to Up next on the observability board.
Dzahn triaged this task as High priority.Mar 23 2018, 9:54 PM
1312,1348d1311
< 
< define contact{
<         contact_name                    mholloway
<         alias                           Michael Holloway
<         host_notification_period        24x7
<         service_notification_period     24x7
<         host_notification_options       d,r,f
<         service_notification_options    c,r,f
<         email                           ..
<         host_notification_commands      host-notify-by-email
<         service_notification_commands   notify-by-email
< }
< 
< define contact{
<         contact_name                    bearnd
<         alias                           Bernd Sitzmann
<         host_notification_period        24x7
<         service_notification_period     24x7
<         host_notification_options       d,r,f
<         service_notification_options    c,r,f
<         email                           ..
<         host_notification_commands      host-notify-by-email
<         service_notification_commands   notify-by-email
< }
< 
< define contact{
<         contact_name                    tgr
<         alias                           Gergo Tisza
<         host_notification_period        24x7
<         service_notification_period     24x7
<         host_notification_options       d,r,f
<         service_notification_options    c,r,f
<         email                          ..
<         host_notification_commands      host-notify-by-email
<         service_notification_commands   notify-by-email
< }
<

created the contacts in private repo.

now we can add new contactgroups in the public repo that use them as members

the "cn" of the LDAP user has to match the Icinga contact for the permission thing to work.

i see a possible issue with the special character in tgr's CN, but we have to try it out.

Change 421664 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: add contactgroups mobileapps, readinglists

https://gerrit.wikimedia.org/r/421664

Change 421664 merged by Dzahn:
[operations/puppet@production] icinga: add contactgroups mobileapps, readinglists

https://gerrit.wikimedia.org/r/421664

Change 421668 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: rename bsitzmann to bearnd in reading infra groups

https://gerrit.wikimedia.org/r/421668

Change 421668 merged by Dzahn:
[operations/puppet@production] icinga: rename bsitzmann to bearnd in reading infra groups

https://gerrit.wikimedia.org/r/421668

Change 421676 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/mobileapps: add mobileapp contacts to service

https://gerrit.wikimedia.org/r/421676

Change 421676 merged by Dzahn:
[operations/puppet@production] icinga/mobileapps: add mobileapp contacts to service

https://gerrit.wikimedia.org/r/421676

@Mholloway @bearND @Tgr You should now receive email notifcations if the service "LVS HTTP IPv4" on host "mobileapps.svc.eqiad.wmnet" or "mobileapps.svc.codfw.wmnet " goes down.

re: readinglists: I am not sure which Icinga check this would be. I can't find any that matches the string "readinglists" yet and when searching the puppet repo i just find a cronjob as part of mw-maintenance but that is all.

I don't think Icinga is applicable to reading lists. It's normally used to check whether services are up, but reading lists are implemented as RESTBase and action API functionality (both of which have their own checks). And it only accepts authenticated requests, so writing checks for it would be nontrivial.

Ah, no need to worry about reading lists, then. Sorry for the partially invalid task description. Looks like this is resolved, then. Thanks, @Dzahn!

Mholloway renamed this task from Add Reading Infrastructure engineers to contacts for RI-maintained services to Add Reading Infrastructure engineers to contacts for mobileapps.Mar 28 2018, 5:49 PM
Mholloway updated the task description. (Show Details)

I think the only thing left would have been to test if you can also execute commands like "schedule downtime" or "ACK" on those checks you are a contact form via the web ui, if you care about that. We could verify that in IRC some time. I think there might be a problem for tgr's user because of the non US-ASCII character in his LDAP user name, but that's all or i would have closed it already.

Besides that, yea resolved.

Sorry, sounds like it might be worth keeping open, then. I don't know my way around the Icinga web UI well enough yet to verify this for myself, and it also sounds like the potential issue with tgr's LDAP username could use double-checking.

@Mholloway Try this please:

@Dzahn I just tried the "add a comment to checked services" command myself and it says Not Authorized.

I am able to access the comment-adding and downtime-scheduling interfaces by following the above instructions. I didn't actually add a comment or schedule any downtime. Does the fact that I was able to get to the interfaces to perform the actions mean that my permissions are correct?

@Mholloway I was able to get to the comment changing interface, too. It just said not authorized after I hit the submit button on the command.

@Mholloway I was able to get to the comment changing interface, too. It just said not authorized after I hit the submit button on the command.

Thanks. Same here when I attempted to commit a testing comment.

Hi, can you both login and then look at the "Logged in as " line showing up in the Icinga web ui and copy/paste that over here? Unfortunately there is a caveat where even capitalization matters, so let's compare the exact user name there.

It lets you login with both versions, capitalized or not (auth_ldap) but at the same time it has to exactly match the Icinga contact name to give you those permissions to run commands.

@Dzahn Mine says "Logged in as BearND" (upper case B).

I believe 'mholloway' is the canonical capitalization for me, but for the record I got the same result when tried logging in as Mholloway (which is how I show up in Gerrit). The UI does note the capitalization difference:

Logged in as Mholloway

Same here. Tried with bearND but same result.

I debugged this by looking at the generated Icinga config directly on the server.

I found that i gave you wrong instructions / it's just partially solved.

Only the services called "Mobileapps LVS eqiad" and "Mobileapps LVS codfw" currently have the "mobileapps" contact group (which gives the permissions we are looking for).

The service called "LVS HTTP IPv4" on the mobileapps.svc host does not have the mobileapps contact group yet, so you don't get permissions.

You should test if it works for that service. Search for "Mobileapps" and then scroll down to the service (not host) called "Mobileapps LVS eqiad". That should let you run commands on it.

Meanwhile i should check puppet code to add the additional contact groups also to the service that i mentioned in my original instructions.

Adding a custom contact group to the "LVS HTTP IPv4" service doesn't look trivial to me.

First we have the defined type in modules/lvs/manifests/monitor_service_http_https.pp in there we have

contact_group => $contact_group and $contact_group = 'admins' is a default parameter. So far so good.

Then if we check where that defined type is used we get to:

modules/lvs/manifests/monitor.pp

which has this:

# This is a hack. Use a template to get a yaml structure of a ruby hash,
# then use parseyaml from puppetlabs/stdlib to get a puppet hash back
$yaml_tmp_var = template('lvs/monitor_lvs.erb')
$monitors = parseyaml($yaml_tmp_var)
create_resources(lvs::monitor_service_http_https, $monitors)

From there i got to the template used in this construct, at modules/lvs/templates/monitor_lvs.erb, which has:

if service['icinga']['contact_group']
               tmp[hostname]['contact_group'] = service['icinga']['contact_group']

All of this creates the identical monitoring check for all services, not just mobileapps.

We have to somehow set it to a custom $contact_group for just mobileapps / for each service.

@akosiaris Got a hint how i should do that the right way (have different contact groups per service) ?

I think the services themselves should be sufficient for us. We probably don't need to hosts themselves if it's too much hassle. I've tried the ones you mentioned but still nothing.

HostServiceStatus
mobileapps.svc.eqiad.wmnetMobileapps LVS eqiadNot Authorized
HostServiceStatus
mobileapps.svc.codfw.wmnetMobileapps LVS codfwNot Authorized

Change 425991 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add mobileapps to contacts for mobileapps LVS service

https://gerrit.wikimedia.org/r/425991

Same here. Tried with bearND but same result.

It's bearnd (all lowercase), per https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/nagios_common/files/contactgroups.cfg;2fb9102463703e5666972c3ba7fef0c127408f38$121

@akosiaris Got a hint how i should do that the right way (have different contact groups per service) ?

@Dzahn I think https://gerrit.wikimedia.org/r/425991 will work. At least so says PCC at https://puppet-compiler.wmflabs.org/compiler02/10921/

Having 2 LVS service definitions in icinga is confusing. The plan was to replace the LVS HTTP with the new one IIRC, but that was never completed. Relevant task was T134551.

That worked! Thank you! I added a comment to the eqiad one. Not sure how to view or remove it but I made it not persistent so hopefully it should go away automatically after some time.

That worked! Thank you! I added a comment to the eqiad one. Not sure how to view or remove it but I made it not persistent so hopefully it should go away automatically after some time.

No, it will not go away automatically after time some period. Not Persistent means that is will go away if nagios is restarted. How that is in any way useful is beyond my comprehension, I 've never found any use for that. FWIW we don't really restart nagios on any kind of time basis, just reload it.

Anyway, you can view it at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mobileapps.svc.eqiad.wmnet&service=Mobileapps+LVS+eqiad#comments where it's also possible to delete it

I was able to add a (non-persistent) test comment to Mobileapps LVS eqiad as well.

I don't seem to be able to ACK the service endpoint health check alerts (as I just attempted to do for the current unhandled feed/announcements alert, which will be fixed with today's deployment). Is that intended?

Screen Shot 2018-04-18 at 9.53.13 AM.png (558×643 px, 98 KB)

@Mholloway I do see your ack in the Icinga UI:

feed/annoucements expected output discrepancy will be fixed during today's deployment (see a992d05)

Change 427417 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: add contactgroup for mobileapps to Hiera

https://gerrit.wikimedia.org/r/427417

Change 427417 merged by Dzahn:
[operations/puppet@production] icinga: add contactgroup for mobileapps to Hiera

https://gerrit.wikimedia.org/r/427417

Tried to add missing contactgroup to mobileapps services with the change above (copying wdqs setup) but it seems that wasn't enough to actually make it happen yet. (ran puppet on einsteinium and scb1001)

Change 429827 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Add contactgroup for mobileapps

https://gerrit.wikimedia.org/r/429827

Change 429827 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Add contactgroup for mobileapps

https://gerrit.wikimedia.org/r/429827

Change 425991 merged by Alexandros Kosiaris:
[operations/puppet@production] Add mobileapps to contacts for mobileapps LVS service

https://gerrit.wikimedia.org/r/425991

I think the 2 patches above have fixed the both issues. mobileapps team will get notifications for all services (including LVS) and will be able to acknowledge them. I think this resolves this.

I confirmed that on einsteinium, the services have the right contact groups now. Reverted my own change that added it in the wrong place (Hiera).