Page MenuHomePhabricator

Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group
Closed, ResolvedPublic

Description

In order to be able to control notification settings on wdqs_eqiad hosts (wdqs1001 and wdqs1002) - e.g. disable notifications when one of the hosts is taken down for maintenance - need access to monitoring settings on icinga.

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.

@Smalyshev As a first step, can we confirm that a basic login on icinga.wikimedia.org works for you? Read-only access should already work and it should be your LDAP/Wikitech/Labs user as long as you are in the WMF group. Does that work? If yes, is the username exactly as here on phabricator? Permissions to send commands would be a separate thing and not handled via LDAP but require a puppet change.

@Dzahn, yes, I can log in to icinga and see stuff, but not control notifications. The username is "smalyshev".

@Smalyshev Ok, great. So the next step to be able to run commands (schedule downtime, disable notifications, ACKnowledge issues, etc) and also to get notifications (email, paging) is that in the Icinga context you have to be a contact (user).

We keep these in a private repo because they contain phone numbers. In your cause i have just added email and skipped the phone part for now. That can be changed later if desired.

Because it's just email i left the notification period at 24x7, but we can also use custom timezones here.

Even without any notification options we would need the "contact" to exist to give it permissions, so i added this.

define contact{
        contact_name                    smalyshev
        alias                           Stas Malychev
        host_notification_period        24x7
        service_notification_period     24x7
        host_notification_options       d,r,f
        service_notification_options    c,r,f
        email                           smalyshev@wikimedia.org
        address1                        smalyshev@wikimedia.org
        host_notification_commands      host-notify-by-email
        service_notification_commands   notify-by-email
}

the options mean that you get notified if hosts are: d (down), r (recover) or are f (flapping) and if services are c (critical) or r (recover) or are f (flapping).

Now this icinga contact can be used in puppet classes that apply monitoring for wdqs servers so that it becomes attached to the right hosts and services. Hopefully that already solves it because being a contact in Icinga gives you the permissions for these services. (as opposed to global permissions for all services and hosts that would be specified in cgi.cfg)

@Dzahn see T105229#1600235 for being able to send the commands as a contact :)

Dzahn triaged this task as Medium priority.Sep 3 2015, 2:01 AM

Change 237499 had a related patch set uploaded (by Dzahn):
icinga: add new contact group wdqs-admins

https://gerrit.wikimedia.org/r/237499

Change 237499 merged by Dzahn:
icinga: add new contact group wdqs-admins

https://gerrit.wikimedia.org/r/237499

Change 237504 had a related patch set uploaded (by Dzahn):
icinga: set admin groups for common/wdqs in hiera

https://gerrit.wikimedia.org/r/237504

Change 237504 merged by Dzahn:
icinga: set contact groups for common/wdqs in hiera

https://gerrit.wikimedia.org/r/237504

Change 237508 had a related patch set uploaded (by Dzahn):
wdqs: set icinga contact groups, add wdqs-admins

https://gerrit.wikimedia.org/r/237508

Change 237508 merged by Dzahn:
wdqs: set icinga contact groups, add wdqs-admins

https://gerrit.wikimedia.org/r/237508

This needs https://gerrit.wikimedia.org/r/#/c/235065/ it appears. Johnlewis said @RobH was going to review.

I merged https://gerrit.wikimedia.org/r/#/c/235065/ . There is no difference on neon, not in a bad way but also not in a good way.

@JohnLewis see comment on gerrit. the override did not work before?!

We believe this is blocked by Ops, who are currently attending their offsite. This task isn't urgent and can wait until the offsite has concluded.

We believe this is blocked by Ops, who are currently attending their offsite. This task isn't urgent and can wait until the offsite has concluded.

Offsite is next week. My understanding is this is blocked on figuring out why my patch didn't change things. Will look later with ops help.

Offsite is next week. My understanding is this is blocked on figuring out why my patch didn't change things. Will look later with ops help.

Ah. Excellent! Thank you.

Change 244722 had a related patch set uploaded (by Dzahn):
icinga: ensure hiera lookups for all contact_group defs

https://gerrit.wikimedia.org/r/244722

Change 244722 merged by Dzahn:
icinga: ensure hiera lookups for all contact_group defs

https://gerrit.wikimedia.org/r/244722

@Dzahn any progress on this?

@Smalyshev yes, now there is. thanks to John Lewis the override for contact groups via hieradata, that was broken, works now. That gets us an important step closer. Now we can set the right contacts and those should determine the permissions to send commands.

Change 244813 had a related patch set uploaded (by Dzahn):
fix hiera key for wdqs (contactgroups)

https://gerrit.wikimedia.org/r/244813

Change 244813 merged by Dzahn:
fix hiera key for wdqs (contactgroups)

https://gerrit.wikimedia.org/r/244813

now in icinga config we can see how our new contact group has been added to services on wdqs hosts.

1root@neon:/etc/icinga# grep -B3 -A1 wdqs-admins /etc/icinga/puppet_services.cfg | grep -v freshness | grep -v check_period
2
3 check_command nrpe_check!check_check_dhclient!10
4 contact_groups admins,wdqs-admins
5 host_name wdqs1001
6--
7 check_command nrpe_check!check_check_eth!10
8 contact_groups admins,wdqs-admins
9 host_name wdqs1001
10--
11 check_command nrpe_check!check_check_salt_minion!10
12 contact_groups admins,wdqs-admins
13 host_name wdqs1001
14--
15 check_command nrpe_check!check_disk_space!10
16 contact_groups admins,wdqs-admins
17 host_name wdqs1001
18--
19 check_command nrpe_check!check_dpkg!10
20 contact_groups admins,wdqs-admins
21 host_name wdqs1001
22--
23 check_command check_ntp_time!0.5!1
24 contact_groups admins,wdqs-admins
25 host_name wdqs1001
26--
27 check_command nrpe_check!check_puppet_checkpuppetrun!10
28 contact_groups admins,wdqs-admins
29 host_name wdqs1001
30--
31 check_command nrpe_check!check_raid!10
32 contact_groups admins,wdqs-admins
33 host_name wdqs1001
34--
35 check_command check_ssh
36 contact_groups admins,wdqs-admins
37 host_name wdqs1001
38--
39 check_command nrpe_check!check_WDQS_Blazegraph_process!10
40 contact_groups admins,wdqs-admins
41 host_name wdqs1001
42--
43 check_command check_http!query.wikidata.org!/!Welcome
44 contact_groups admins,wdqs-admins
45 host_name wdqs1001
46--
47 check_command check_http!query.wikidata.org!/bigdata/namespace/wdq/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&format=json!"xsd:dateTime"
48 contact_groups admins,wdqs-admins
49 host_name wdqs1001
50--
51 check_command nrpe_check!check_WDQS_Internal_HTTP_endpoint!10
52 contact_groups admins,wdqs-admins
53 host_name wdqs1001
54--
55 check_command nrpe_check!check_WDQS_Local_Blazegraph_endpoint!10
56 contact_groups admins,wdqs-admins
57 host_name wdqs1001
58--
59 check_command nrpe_check!check_WDQS_Updater_process!10
60 contact_groups admins,wdqs-admins
61 host_name wdqs1001
62--
63 check_command nrpe_check!check_check_dhclient!10
64 contact_groups admins,wdqs-admins
65 host_name wdqs1002
66--
67 check_command nrpe_check!check_check_eth!10
68 contact_groups admins,wdqs-admins
69 host_name wdqs1002
70--
71 check_command nrpe_check!check_check_salt_minion!10
72 contact_groups admins,wdqs-admins
73 host_name wdqs1002
74--
75 check_command nrpe_check!check_disk_space!10
76 contact_groups admins,wdqs-admins
77 host_name wdqs1002
78--
79 check_command nrpe_check!check_dpkg!10
80 contact_groups admins,wdqs-admins
81 host_name wdqs1002
82--
83 check_command check_ntp_time!0.5!1
84 contact_groups admins,wdqs-admins
85 host_name wdqs1002
86--
87 check_command nrpe_check!check_puppet_checkpuppetrun!10
88 contact_groups admins,wdqs-admins
89 host_name wdqs1002
90--
91 check_command nrpe_check!check_raid!10
92 contact_groups admins,wdqs-admins
93 host_name wdqs1002
94--
95 check_command check_ssh
96 contact_groups admins,wdqs-admins
97 host_name wdqs1002
98--
99 check_command nrpe_check!check_WDQS_Blazegraph_process!10
100 contact_groups admins,wdqs-admins
101 host_name wdqs1002
102--
103 check_command check_http!query.wikidata.org!/!Welcome
104 contact_groups admins,wdqs-admins
105 host_name wdqs1002
106--
107 check_command check_http!query.wikidata.org!/bigdata/namespace/wdq/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&format=json!"xsd:dateTime"
108 contact_groups admins,wdqs-admins
109 host_name wdqs1002
110--
111 check_command nrpe_check!check_WDQS_Internal_HTTP_endpoint!10
112 contact_groups admins,wdqs-admins
113 host_name wdqs1002
114--
115 check_command nrpe_check!check_WDQS_Local_Blazegraph_endpoint!10
116 contact_groups admins,wdqs-admins
117 host_name wdqs1002
118--
119 check_command nrpe_check!check_WDQS_Updater_process!10
120 contact_groups admins,wdqs-admins
121 host_name wdqs1002

And finally I added can_submit_commands 1 to the contact of smalyshev in the private repo.

He could confirm he can send commands now for wdqs-services but not for other services. Just like we wanted.

And set in a role in hiera. :)

Checked and now I can control notifications for wdqs and also am getting alerts by email. Thanks!

Dzahn removed a project: Patch-For-Review.

added "w" to service_notification_options and "u" to host_notification_options in the contact definition, so that there is also mail for warnings (and host unreachable) (per talk on IRC)