Page MenuHomePhabricator

Align mw-on-k8s alerts with PHP 8.1 migration
Closed, ResolvedPublic

Description

Specifically, alerts defined in [0] such as PHPFPMTooBusy aggregate by the k8s deployment label.

During turnup of the -next and -migration releases to support 8.1, we left this label (as before) equal to the namespace name [1], largely on the basis that these releases are part of the same logical service (and are readily differentiated by other means - e.g., the release label, the servergroup tag, etc.).

However, that means they're inappropriately aggregated together with other releases, which is not what we want.

Instead, we can either:

  1. Extend the mediawiki chart to permit overriding the label value in some sensible way, and then use it (e.g., set to mw-web-next in the -next deployment of mw-web). This has the downside that a number of other places would need updated where we use this label - e.g., service dashboards in grafana.
  2. Update the alert signal expressions in mw-on-k8s.yaml to also group by release. This has the side effect that, e.g., main and canary can alert independently (which now that I think about it, is probably something we want if indeed the canary is a canary).
  3. Something else??

I'm inclined to reach for #2, but wanted to get your take @jijiki before proceeding.

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/mw-on-k8s.yaml

[1] https://gerrit.wikimedia.org/g/operations/deployment-charts/+/393bcaf231597120b6d561328ca7869ad3476d5b/charts/mediawiki/templates/deployment.yaml.tpl#22

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedReedy
StalledNone
OpenNone
OpenNone
OpenNone
ResolvedReedy
ResolvedKrinkle
ResolvedKrinkle
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedLucas_Werkmeister_WMDE
ResolvedNone
ResolvedJdforrester-WMF
ResolvedDaimona
ResolvedJdforrester-WMF
DeclinedNone
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
Resolvedcscott
ResolvedScott_French
DuplicatePRODUCTION ERRORNone
ResolvedPRODUCTION ERRORMichael
ResolvedPRODUCTION ERRORMichael
ResolvedMichael
DuplicatePRODUCTION ERRORNone
ResolvedTgr
ResolvedNone
ResolvedDAlangi_WMF
ResolvedTgr
ResolvedDAlangi_WMF
ResolvedTgr
ResolvedTgr
ResolvedAtieno
OpenNone
Resolvedbrouberol
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedKrinkle
ResolvedKrinkle
ResolvedScott_French
ResolvedKrinkle
ResolvedTgr
ResolvedScott_French
Resolvedjnuche
ResolvedJdforrester-WMF
ResolvedBUG REPORTbd808
ResolvedReedy
ResolvedReedy
Resolvedseanleong-WMDE
StalledNone
OpenNone
ResolvedLucas_Werkmeister_WMDE
ResolvedDaimona
ResolvedDaimona
ResolvedDaimona
OpenNone
ResolvedUmherirrender
OpenNone
ResolvedUmherirrender
ResolvedUmherirrender
Resolved mszabo
Resolvedtstarling
ResolvedUmherirrender
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedPhysikerwelt
ResolvedTgr
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedNone
ResolvedUmherirrender
ResolvedNone
ResolvedNone
ResolvedkarapayneWMDE
ResolvedAudreyPenven_WMDE
ResolvedAudreyPenven_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedUmherirrender
Resolvedthiemowmde
ResolvedLucas_Werkmeister_WMDE
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedUmherirrender
Resolved mszabo
ResolvedxSavitar
ResolvedUmherirrender
ResolvedUmherirrender
ResolvedUmherirrender
OpenNone
OpenNone
OpenNone
OpenDannyS712
ResolvedUmherirrender
Resolved larissagaulia
ResolvedUmherirrender
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedKrinkle
ResolvedScott_French
ResolvedScott_French

Event Timeline

Change #1113811 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/alerts@master] mw-on-k8s: update PHPFPMTooBusy to alert per release

https://gerrit.wikimedia.org/r/1113811

I completely agree that #2 is our best option in terms of value/effort, moving forward with that

Change #1114018 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] mw-on-k8s: aggregate remaining alerts by release name

https://gerrit.wikimedia.org/r/1114018

Thanks, @jijiki! Between your patch and mine, I believe that should cover these alerts.

Change #1113811 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: update PHPFPMTooBusy to alert per release

https://gerrit.wikimedia.org/r/1113811

Change #1114018 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: aggregate remaining alerts by release name

https://gerrit.wikimedia.org/r/1114018

That should now cover all of the mw-on-k8s alerts. Thanks @jijiki!