mgmt outages for cloud* systems seem to page everyone
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	May 16 2019, 3:22 PM

Description

There was just a brief power outage to the mgmt layer in B5 and the cloud* and lab* servers in that row seem to have paged the whole SRE team. In theory they should have paged the wmcs staff but not the other SREs... the general DC policy is to have things like this not page anyone.

This is something that's going to come up very infrequently, but let's try to figure out what happened so we understand how our paging config works.

Details

Subject	Repo	Branch	Lines +/-
monitoring: set wmcs servers to email when mgmt interfaces fail	operations/puppet	production	+19 -0
host monitoring: add optional contact group for mgmt interfaces	operations/puppet	production	+7 -2
Don't page on mgmt failures	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T228878: Reduce Icinga alert noise
Mentioned Here: T229884: Review paging for WMCS systems

Event Timeline

Andrew created this task.May 16 2019, 3:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 16 2019, 3:22 PM

If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior. @ArielGlenn says they did not get pages. @fgiunchedi /did/ get paged though. So... ???

A list of all the people paged is at https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?contact=all

FWIW the paging behavior for cloud* hosts down happens even when non-mgmt is involved, i.e. production network

aborrero merged a task: T223884: WMCS hosts management interface Icinga check should not page.May 20 2019, 9:53 AM

aborrero added a subscriber: Volans.

If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior.

Are you sure you want to be paged for a mgmt console going down? Wouldn't it be enough to have a ticket for it and handle it non-realtime?

I just accidentally paged the whole SRE team by stopping designate-sink on cloudcontrol1004. That should definitely only page wmcs staff! Probably this is related somehow...

On the designate-sink issue:

class '::openstack::designate::monitor'
Has the critical => true, parameter. (see here)

Which is passed along to nrpe::monitor_service (see here)

What critical does is that it adds the contact group sms and admins to the currently defined contact groups. (see here)

There are different ways of tackling this issue, depending on who it should page (nobody/cloud/SREs) during normal operations (outside of maintenance, etc.).
If nobody then change critical to false.
If cloud then add the cloud paging group to contact_groups (note that this will override the do_paging hiera key).
If SRE, document what needs to be done when it pages.

It might also be worth auditing where critical => true is set.

The critical keyword is also not clear. It should maybe be renamed to page. Where the value is the team that needs to be paged.

In T223458#5238367, @ayounsi wrote

The critical keyword is also not clear. It should maybe be renamed to page. Where the value is the team that needs to be paged.

That's true, but it goes even further since whether somebody gets paged or not is actually configured with individual contacts inside a contactgroup and not the group itself. It's just by social convention that members of that group called "sms" happen to have phone numbers as a notification method set up. The name of the group "sms" is also misleading, it should probably be renamed to "sre".

Change 523963 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Don't page on mgmt failures

https://gerrit.wikimedia.org/r/523963

gerritbot added a project: Patch-For-Review.Jul 17 2019, 4:20 PM

Change 523963 merged by Alexandros Kosiaris:
[operations/puppet@production] Don't page on mgmt failures

https://gerrit.wikimedia.org/r/523963

Changed merged, this should be resolved now.

reopening it as there was a sub-issue mentioned in T223458#5238223. Might be worth forking it into its own task though.

fgiunchedi mentioned this in T228878: Reduce Icinga alert noise.Jul 24 2019, 2:16 PM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 11 2019, 3:15 PM

bd808 moved this task from Doing to Watching on the cloud-services-team (Kanban) board.Sep 12 2019, 9:14 PM

Dzahn unsubscribed.Sep 20 2019, 7:10 PM

Apparently with the current setup, nobody in SRE is paged (by SMS) because mgmt interfaces, but we WMCS do get paged:

Yup. I'm going to look at how we might be able to change that today, but I think it is related to the host settings. The WMCS mgmt interface outages stopped paging everyone after T229884

Change 543916 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] host monitoring: add optional contact group for mgmt interfaces

https://gerrit.wikimedia.org/r/543916

In T223458#5583608, @aborrero wrote:

Apparently with the current setup, nobody in SRE is paged (by SMS) because mgmt interfaces, but we WMCS do get paged:

I must admit I am not sure I understand. Isn't the above what we wanted? Or is something missing? It might be by the way cause we have overloaded the term page. From what I gather some people (e.g. me) use it to refer to receiving alerts by SMS whereas others use it to refer to any kind of alert regardless of medium? Can we clearly spell out what we want? Cause I have trouble following.

Let me give it a try:

Production SREs DON'T want to receive SMS alerts for any host, unless they specifically request it, on a per person or host level.
Production SREs DON'T want to receive SMS alerts for any mgmt interface, unless they specifically request it, on a per person or host level.
Production SREs DO want to receive IRC/email alerts for any host
Production SREs DO want to receive IRC/email alerts for any mgmt interface. <=== That's arguably wrong. We could skip those and just rely on the web interface to detect them, no need for alert spam as it's probably not critical, but I 'd like to hear opinions

WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure
WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.
WMCS SREs DO want to receive IRC/email alerts for any host.
WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

In T223458#5586937, @akosiaris wrote:

WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure

WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.

WMCS SREs DO want to receive IRC/email alerts for any host.

WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

WMCS SREs DO NOT want to receive SMS alerts for mgmt in the WMCS infrastructure.

Otherwise I think that yes, the list is correct.

In T223458#5587065, @bd808 wrote:

In T223458#5586937, @akosiaris wrote:

WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure

WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.

WMCS SREs DO want to receive IRC/email alerts for any host.

WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

WMCS SREs DO NOT want to receive SMS alerts for mgmt in the WMCS infrastructure.

Otherwise I think that yes, the list is correct.

OK, cool. thanks for clearing it up. I have a clearer picture now. In that case, we can probably resolve this after @Bstorm's patch get merged and an override for mgmt_contact_group is set for WMCS.

Change 543916 merged by Bstorm:
[operations/puppet@production] host monitoring: add optional contact group for mgmt interfaces

https://gerrit.wikimedia.org/r/543916

Change 545386 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] monitoring: set wmcs servers to email when mgmt interfaces fail

https://gerrit.wikimedia.org/r/545386

Change 545386 merged by Bstorm:
[operations/puppet@production] monitoring: set wmcs servers to email when mgmt interfaces fail

https://gerrit.wikimedia.org/r/545386

Interesting, WMCS did get an SMS on labstore1005.mgmt on 3/20. That suggests something in this task either doesn't work or has been reverted/changed since this time.

Aklapper removed a project: Patch-For-Review.Feb 3 2022, 4:32 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Actually, let's resolve this. Last comment was 2 years ago and I have no recollection of this happening again since Brooke's comment. We can always reopen if needed.

	F30810348: image.png
	Oct 17 2019, 12:02 PM

mgmt outages for cloud* systems seem to page everyoneClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

mgmt outages for cloud* systems seem to page everyone
Closed, ResolvedPublic
Actions