Page MenuHomePhabricator

WMCS-roots paging responsibilities
Closed, DeclinedPublic

Description

While discussing T344599 it was requested that anyone with root privileges to the wikireplicas should also be paged. I'm not too familiar with the wiki-replicas servers and suspect that some of the issue there is around ownership and demarcation i.e. if something breaks on theses machines due to a misuse of root privileges (weather accidental or intentioned) then we require someone familiar with this service to also get paged and assist in fixing the machine. Some of theses issues may get resolved via the current ownership discussions.

That aside i wondered if we should have some general policy on uses granted wmcs-roots i.e. should they be added to the batphone paging group in victorops (im not even sure if wmcs has a batphone group)?

Personally i dont have a view on this i just wanted to raise the ticket and will leave the discussions for wmcs and wmcs-roots members

Event Timeline

Marostegui added subscribers: mark, KOfori.
Marostegui added subscribers: BTullis, odimitrijevic.

I am not sure what's the current status for wikireplicas alerts. I do know they do alert on IRC, but I am not sure if they already page cloud-services-team engineers or not.

jbond renamed this task from WMCS-roots pageing responibilities to WMCS-roots paging responsibilities .Aug 21 2023, 1:45 PM

That aside i wondered if we should have some general policy on uses granted wmcs-roots i.e. should they be added to the batphone paging group in victorops (im not even sure if wmcs has a batphone group)?

WMCS does not currently have a batphone group, but we have a wmcs team in VictorOps with a multi-step escalation policy, so I believe it could be reasonable to ask members of wmcs-roots to be added to the wmcs group in VictorOps. This is not true at the moment though, as we have a few people in wmcs-roots that are NOT part of the wmcs group in VictorOps.

I am not sure what's the current status for wikireplicas alerts. I do know they do alert on IRC, but I am not sure if they already page cloud-services-team engineers or not.

I believe they do alert on IRC and email, but they are not tagged with wmcs in Prometheus/VictorOps, hence they do not page cloud-services-team engineers. If you can find a recent wikireplica alert I can double check that.

@fnegri keep in mind that wikireplicas do not page for SRE as well. The most recent alert I can think of are the ones related to the last outage T337446 which might have shown lag/replication broken, however I think I caught the first alert (sanitarium) and probably downtime the others to avoid noise. Sanitarium hosts should also be alerting, as they are part of the wiki replicas service (you can see the alert they triggered at https://wikitech.wikimedia.org/wiki/Incidents/2023-05-28_wikireplicas_lag)

I found that alert in Logstash, attached is the alert JSON data from Logstash.

"team": "sre" means that it didn't page WMCS, but shouldn't it have paged SREs as it has "alert_severity": "critical"?

There are two possible reasons:

  1. I caught the alert too fast before it even paged.
  2. The host is marked as non critical (aka irc-alert only) somewhere else.
fnegri triaged this task as Medium priority.Sep 9 2024, 2:26 PM

In T344599: wikireplicas root access, it was decided members of wmcs-roots should not have root access to wiki replicas hosts (clouddbXXXX). This removes the original concern of having people with root access to wikireplicas that are not getting pages.

As far as I understand, the only thing that the wmcs-roots group does is giving its members root access to all cloud* bare metal hosts, excluding clouddb1* (wiki replicas).

we have a few people in wmcs-roots that are NOT part of the wmcs group in VictorOps.

This remains true as wmcs-roots is a group that extends to non-SREs and non-staff, so we cannot simply add everyone in that group to the on-call rota. We could consider removing people from that group (limiting their access to the Cloud infrastructure) but that is beyond the scope of this task.

The "wmcs" group in VictorOps (the people receiving on-call pages for WMCS alerts) is a smaller group that at the moment is managed ad-hoc and is not mapped to any group in either LDAP or modules/admin/data/data.yaml.