Page MenuHomePhabricator

mgmt outages for cloud* systems seem to page everyone
Closed, ResolvedPublic

Description

There was just a brief power outage to the mgmt layer in B5 and the cloud* and lab* servers in that row seem to have paged the whole SRE team. In theory they should have paged the wmcs staff but not the other SREs... the general DC policy is to have things like this not page anyone.

This is something that's going to come up very infrequently, but let's try to figure out what happened so we understand how our paging config works.

Event Timeline

If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior. @ArielGlenn says they did not get pages. @fgiunchedi /did/ get paged though. So... ???

FWIW the paging behavior for cloud* hosts down happens even when non-mgmt is involved, i.e. production network

If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior.

Are you sure you want to be paged for a mgmt console going down? Wouldn't it be enough to have a ticket for it and handle it non-realtime?

I just accidentally paged the whole SRE team by stopping designate-sink on cloudcontrol1004. That should definitely only page wmcs staff! Probably this is related somehow...

On the designate-sink issue:

class '::openstack::designate::monitor'
Has the critical => true, parameter. (see here)

Which is passed along to nrpe::monitor_service (see here)

What critical does is that it adds the contact group sms and admins to the currently defined contact groups. (see here)

There are different ways of tackling this issue, depending on who it should page (nobody/cloud/SREs) during normal operations (outside of maintenance, etc.).
If nobody then change critical to false.
If cloud then add the cloud paging group to contact_groups (note that this will override the do_paging hiera key).
If SRE, document what needs to be done when it pages.

It might also be worth auditing where critical => true is set.

The critical keyword is also not clear. It should maybe be renamed to page. Where the value is the team that needs to be paged.

The critical keyword is also not clear. It should maybe be renamed to page. Where the value is the team that needs to be paged.

That's true, but it goes even further since whether somebody gets paged or not is actually configured with individual contacts inside a contactgroup and not the group itself. It's just by social convention that members of that group called "sms" happen to have phone numbers as a notification method set up. The name of the group "sms" is also misleading, it should probably be renamed to "sre".

Change 523963 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Don't page on mgmt failures

https://gerrit.wikimedia.org/r/523963

Change 523963 merged by Alexandros Kosiaris:
[operations/puppet@production] Don't page on mgmt failures

https://gerrit.wikimedia.org/r/523963

akosiaris claimed this task.
akosiaris triaged this task as Low priority.
akosiaris subscribed.

Changed merged, this should be resolved now.

reopening it as there was a sub-issue mentioned in T223458#5238223. Might be worth forking it into its own task though.

Apparently with the current setup, nobody in SRE is paged (by SMS) because mgmt interfaces, but we WMCS do get paged:

image.png (535×1 px, 184 KB)

Yup. I'm going to look at how we might be able to change that today, but I think it is related to the host settings. The WMCS mgmt interface outages stopped paging everyone after T229884

Change 543916 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] host monitoring: add optional contact group for mgmt interfaces

https://gerrit.wikimedia.org/r/543916

Apparently with the current setup, nobody in SRE is paged (by SMS) because mgmt interfaces, but we WMCS do get paged:

image.png (535×1 px, 184 KB)

I must admit I am not sure I understand. Isn't the above what we wanted? Or is something missing? It might be by the way cause we have overloaded the term page. From what I gather some people (e.g. me) use it to refer to receiving alerts by SMS whereas others use it to refer to any kind of alert regardless of medium? Can we clearly spell out what we want? Cause I have trouble following.

Let me give it a try:

  • Production SREs DON'T want to receive SMS alerts for any host, unless they specifically request it, on a per person or host level.
  • Production SREs DON'T want to receive SMS alerts for any mgmt interface, unless they specifically request it, on a per person or host level.
  • Production SREs DO want to receive IRC/email alerts for any host
  • Production SREs DO want to receive IRC/email alerts for any mgmt interface. <=== That's arguably wrong. We could skip those and just rely on the web interface to detect them, no need for alert spam as it's probably not critical, but I 'd like to hear opinions
  • WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure
  • WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.
  • WMCS SREs DO want to receive IRC/email alerts for any host.
  • WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

  • WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure
  • WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.
  • WMCS SREs DO want to receive IRC/email alerts for any host.
  • WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

  • WMCS SREs DO NOT want to receive SMS alerts for mgmt in the WMCS infrastructure.

Otherwise I think that yes, the list is correct.

  • WMCS SREs DO want to receive SMS alerts for hosts in the WMCS infrastructure
  • WMCS SREs DO want to receive SMS alerts for mgmt in the WMCS infrastructure.
  • WMCS SREs DO want to receive IRC/email alerts for any host.
  • WMCS SREs DO want to receive IRC/email alerts for any mgmt interface.

Is this ^ even close?

  • WMCS SREs DO NOT want to receive SMS alerts for mgmt in the WMCS infrastructure.

Otherwise I think that yes, the list is correct.

OK, cool. thanks for clearing it up. I have a clearer picture now. In that case, we can probably resolve this after @Bstorm's patch get merged and an override for mgmt_contact_group is set for WMCS.

Change 543916 merged by Bstorm:
[operations/puppet@production] host monitoring: add optional contact group for mgmt interfaces

https://gerrit.wikimedia.org/r/543916

Change 545386 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] monitoring: set wmcs servers to email when mgmt interfaces fail

https://gerrit.wikimedia.org/r/545386

Change 545386 merged by Bstorm:
[operations/puppet@production] monitoring: set wmcs servers to email when mgmt interfaces fail

https://gerrit.wikimedia.org/r/545386

Interesting, WMCS did get an SMS on labstore1005.mgmt on 3/20. That suggests something in this task either doesn't work or has been reverted/changed since this time.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

akosiaris claimed this task.

Actually, let's resolve this. Last comment was 2 years ago and I have no recollection of this happening again since Brooke's comment. We can always reopen if needed.