Page MenuHomePhabricator

Check long-running screen/tmux sessions
Closed, ResolvedPublic

Description

We should flag/alert long-running screen sessions, these are usually a sign of work which was forgotten or should rather be puppetised or launched by cron. The script should be able to whitelist some hosts (e.g. on copper there's plenty of screen sessions for huge builds or the mediawiki script runners (like terbium))

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+2 -4
operations/puppetproduction+9 -0
operations/puppetproduction+12 -0
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+4 -2
operations/puppetproduction+1 -1
operations/puppetproduction+64 -0
operations/puppetproduction+3 -3
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
operations/puppetproduction+19 -2
operations/puppetproduction+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+6 -1
operations/puppetproduction+17 -0
operations/puppetproduction+143 -64
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 377823 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: screens: whitelist build, maintenance hosts

https://gerrit.wikimedia.org/r/377823

See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted:

  • package building hosts (copper)
  • mediawiki maintenance servers (terbium/wasat)
  • salt masters (neodymium)
  • puppet masters (frond and backend)
  • analytics_cluster::client (stat1004, notebook)
  • mariadb::core and all other mariadb::* roles (db*)
  • restbase-dev and restbase-test (but not restbase-prod)
  • labtest* (various wmcs::labtest and labtestn roles)
  • analytics_cluster::coordinator (analytics1003, data imports happen here)
  • analytics_cluster::druid::worker (druid* otto/joal replacing pivot)

what else?

See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted:

  • package building hosts (copper)
  • mediawiki maintenance servers (terbium/wasat)
  • salt masters (neodymium)
  • puppet masters (frond and backend)

I see no reason for the backend. Frontend is fine and warranted indeed.

  • analytics_cluster::client (stat1004, notebook)
  • mariadb::core and all other mariadb::* roles (db*)

All roles ? Isn't that a bit too much ? I see no reason for role::tendril for example. @jcrespo, are you OK with all mariadb:: roles being ignored from this check ?

  • restbase-dev and restbase-test (but not restbase-prod)
  • labtest* (various wmcs::labtest and labtestn roles)
  • analytics_cluster::coordinator (analytics1003, data imports happen here)
  • analytics_cluster::druid::worker (druid* otto/joal replacing pivot)

what else?

are you OK with all mariadb:: roles

Every host with a mysql server (plus the mariadb::client s) gets a screen. Sadly, there are 30 roles for mysql servers.

I do not mind getting a reminder by email in case I forget, but I do not want an alert so people either a) "I've dropped your session because it was idle"-losing the environment and buffer b) "there is this alert ongoing, have a look" just for doing what I am supposed to do with the tools I have now available (schema changes, pt-table-checksum, data copy, pt-heartbeat, pt-kill). We should work on having replacements to screen (e.g. automation for schema changes T104459), not with making something that it is currently painful (manual schema changes), even more painful to me (manual schema changes + other ops criticism + constant heads up). People do not help while I am being swamped with schema changes (with the exception a few mediawiki devels), now I am even being "punished" with an ongoing alert for that. Also manuel likes screen a lot (much more than me), so I do not want to be notified every time he is doing his (very useful and rightful) job, too.

I think a dashboard with reports, like we have on tendril for database reports: https://tendril.wikimedia.org/report would be more useful. We get a list of long running queries, for example, but we do not get alerted every time a long running query happens. Out of habit, I look at that and grafana every single day (and multiple times). There is no rush on killing screens, and we can implement some centralized reports for not time-sensitive potential issues with no outage levels (puppet keys certs not signed, orphan salt hosts, hosts with strange process patterns, etc.). The onduty ops takes care of checking it once in a week and solving its owned services, and calls out the previous on duty if he doesn't do it (or rises to ops list it if it starts to be worrying).

My definition of icinga levels is: page: outage (e.g. service down), critical: possible outage or degradation (host or single host's service down) warning: non-time sensitive likely outage in the future (disk at 6% available, puppet disabled). All of them must be immediately actionable.

@jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a long-running forgotten screen caused a real, user-facing outage (we discussed it at an ops meeting at the time).

Also agreed that we should eventually move away from screen for your use cases, but until that happens noone is suggesting to "punish" you :) This is why we've talked about a whitelist from the beginning. I think we can exclude all screens on database hosts for now, unless you have any ideas on an alert that would help you not forget a periodic job that shouldn't run somewhere in a screen in some box. Sounds good?

See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted:

  • package building hosts (copper)
  • mediawiki maintenance servers (terbium/wasat)
  • salt masters (neodymium)
  • puppet masters (frond and backend)
  • analytics_cluster::client (stat1004, notebook)
  • mariadb::core and all other mariadb::* roles (db*)
  • restbase-dev and restbase-test (but not restbase-prod)
  • labtest* (various wmcs::labtest and labtestn roles)
  • analytics_cluster::coordinator (analytics1003, data imports happen here)
  • analytics_cluster::druid::worker (druid* otto/joal replacing pivot)

Could you explain the rationale behind all these? I'm not sure e.g. why puppetmasters would need long-running screen sessions?

Also, it's worth noting that the outage that this task was inspired from was a screen that was running on neodymium. I'm wondering if people really do leave long-running cumin/salt tasks there currently in a screen. If that's the case, then perhaps we would need to address this in a different way, like set expectations before launching a job and warn if it runs more than that.

  • salt masters (neodymium)

With Cumin this also extends to sarin

Also, it's worth noting that the outage that this task was inspired from was a screen that was running on neodymium. I'm wondering if people really do leave long-running cumin/salt tasks there currently in a screen. If that's the case, then perhaps we would need to address this in a different way, like set expectations before launching a job and warn if it runs more than that.

The use case for this is wmf-reimage (which even refuses to start the reimage unless run from screen/tmux). Maybe the whitelist should also allow whitelisting specific programs.

So i ran a fresh cumin command to generate a current list (after quite a few have already been closed after mailing people).

I used [neodymium:~] $ sudo -i cumin '*' 'if pgrep screen > /dev/null; then pgrep screen | xargs ps -o user,pid,stime,command --sort=user; fi' ("if" part to avoid outputting ALL process if none is found, pgrep but pipe into ps again to get the reliable output with column headers and sorted by user name)

I was planning to paste the output into a pastebin limited to Ops as we had talked about but it turned out this isn't possible anymore. The option to use custom policies has been limited to Phab admins apparently. I miss this feature :( and asked in releng channel about it ..

So now i pasted the result into a file /root/screens on neodymium, so ops can see the list with:

[neodymium:~] $ sudo cat /root/screens

Take a look...

[neodymium:~] $ sudo cat /root/screen-hosts

for a list of _just_ the hostnames that have one or more screens currently, much easier to read to get an overview

I reduced the initial whitelist to: build hosts, all mariadb:* roles, puppetmaster frontend-only and restbase-dev/test. ok as first step? Of course we can (and will) always amend to it.

@jcrespo Ok with you? See how on https://gerrit.wikimedia.org/r/#/c/377823/ i excluded all the mariadb roles, so the intention is that you won't get any alerts. After merging that i would then revert-revert the original change to add it back to Icinga.

I'm wondering if people really do leave long-running cumin/salt tasks there currently in a screen.

All my multiple-hosts schema changes run from neodymium, as they are the non-dedicated mariadb::clients. Schema changes/pt-table-checksums runs can take 1 hour to 6 months.

Did T166570 fix some outage conditions?

Dzahn changed the task status from Open to Stalled.Sep 21 2017, 2:09 PM

Setting to stalled for now. I don't personally understand the relation to T166570 yet, looks like it needs more discussion.

Twentyafterfour fixed the "phab paste with custom permissions" feature, so now i could make one again that is limited to members of Operations:

Here you go with the lists: P6034

Change 377823 merged by Dzahn:
[operations/puppet@production] icinga: initial whitelist for screen monitoring

https://gerrit.wikimedia.org/r/377823

Change 376636 abandoned by Dzahn:
icinga/base: turn screen monitoring into a WARN-only check

https://gerrit.wikimedia.org/r/376636

Change 380901 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/base: re-enable screen/tmux monitoring

https://gerrit.wikimedia.org/r/380901

Change 381130 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base: screen-monitor, raise CRIT limit to 1 year

https://gerrit.wikimedia.org/r/381130

Change 381130 merged by Dzahn:
[operations/puppet@production] base: screen-monitor, raise CRIT limit to 1 year

https://gerrit.wikimedia.org/r/381130

Change 380901 merged by Dzahn:
[operations/puppet@production] icinga/base: re-enable screen/tmux monitoring

https://gerrit.wikimedia.org/r/380901

Change 381136 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitoring, whitelist es200[234] by regex

https://gerrit.wikimedia.org/r/381136

Change 381136 merged by Dzahn:
[operations/puppet@production] screen-monitoring, whitelist es200[234] by regex

https://gerrit.wikimedia.org/r/381136

Change 381137 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitoring: whitelist all db/es by regex

https://gerrit.wikimedia.org/r/381137

Change 381137 merged by Dzahn:
[operations/puppet@production] screen-monitoring: whitelist all db/es by regex

https://gerrit.wikimedia.org/r/381137

Dzahn changed the task status from Stalled to Open.Sep 27 2017, 10:55 PM

Done! So far. It has been added to Icinga again after the merges above.

You can now see some WARNs here: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=4&hoststatustypes=3&serviceprops=2097162

also: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=long+running

They are all WARNs for now because the CRIT threshold is set to 1 year.

No "db/dbstore/es" servers are in it anymore, they are now covered by regex, some got in accidentally at first because they were not covered by any role class names, but that's all cleaned up and fixed now.

What do you think? The only thing that keeps me from closing this as resolved is now "should i lower the CRIT threshold to something less than a year" or should i just keep it as warnings. And "is there anything that is alerting now that should be added to whitelists". That would be all.

@Dzahn some comments:

  • ms-fe1005 should be whitelisted until T162123 is done
  • I don't think puppetmasters should be whitelisted, all reimage stuff is now in sarin/neodymium only
  • Although reimage stuff is done on sarin/neodymium, they should never reach the threshold IF people close them after the reimage. But I think those two hosts are also used as MySQL management hosts to run long-running tasks, so probably should be whitelisted as per DBAs request.

For the rest I guess it's a matter of chasing them down with the owners and assess if they were leftover and need to be closed or they should be whitelisted.

Another point of discussion could be if a screen/tmux session that is NOT running anything apart a bash session should alarm. I can see a use case of a group of people deciding that all maintenance task of type X are done from a screen/tmux called foo, to be used as a shared lock. Let me make a practical example:

  • Alice and Bob have to re-image the mw* fleet
  • Alice is EU based and Bob is US based
  • Bob reimage a bunch of hosts in it's work day, then Alice come online in the EU morning and want to continue the reimage effort
  • If they share the same screen/tmux they can easily ensure to not step on each other feet and continue from the last run in a safely manner without needing specific communication.

Note our usage of neodymium/sarin is not required- I thought at the time that mixing mysql and salt quering could be interesting for scripting purposes. We can move the root clients somewere else if it creates issues to other people.

@Volans- your use case is how we use it ; we keep the state of which servers have been applied a particular schema change; it can take months, but most of the time it is idle.

Change 381240 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: whitelist ms-fe1005, rm netmon2001

https://gerrit.wikimedia.org/r/381240

Change 381240 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist ms-fe1005, rm netmon2001

https://gerrit.wikimedia.org/r/381240

Change 381422 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] check_interval, retry_interval for screen/tmux check

https://gerrit.wikimedia.org/r/381422

Today there are ~20 unhandled screen/tmux problems in icinga. Maybe this number will decrease after handling the initial problems, but I could also see this check creating a steady stream of problems to manually investigate.

IMHO it would be good to add an automatic enforcement action and alert on exceptions. Something like this:

  • Automatically kill screen/tmux sessions older than N days, but give people enough time to do their work (1 week?). Print a warning about this in the system login banner and when starting screen/tmux.
  • Alert if a screen/tmux process has been running for longer than the above time via icinga.
  • Timeout idle root shells T122922 faster than screen/tmux sessions are killed.

I think as long as people are aware that long-running and idle sessions are not allowed they can plan accordingly.

Today there are ~20 unhandled screen/tmux problems in icinga.

This means adding the monitoring was a success, right. It detected a lot of things we wanted to detect.

Alert if a screen/tmux process has been running for longer than the above time via icinga.

That's what we are doing, right? They are wall WARNs. My question above was which threshold to set. If 8 hours is considered too short, happy to set it to 1 week. Expectations wildly differ though from hours to weeks.

Change 381422 merged by Dzahn:
[operations/puppet@production] check_interval, retry_interval for screen/tmux check

https://gerrit.wikimedia.org/r/381422

If nobody disagrees I'd whitelist stat100[456] boxes since several people are going to keep using screen/tmux for long computations (like researchers, analysts, etc..) and also druid1003, since we are running some long running jobs that shouldn't be stopped.

I suggested whitelisting these (and other hosts regularly used for screens which i found after my initial check). The reaction was that i should please justify why i want to whitelist all these, so i removed them again.

Today there are ~20 unhandled screen/tmux problems in icinga.

This means adding the monitoring was a success, right. It detected a lot of things we wanted to detect.

Totally, it seems to be working great and has found many sessions older than the 8 hour threshold.

Alert if a screen/tmux process has been running for longer than the above time via icinga.

That's what we are doing, right? They are wall WARNs. My question above was which threshold to set. If 8 hours is considered too short, happy to set it to 1 week. Expectations wildly differ though from hours to weeks.

We are doing the alert part, but not yet automatic enforcement. To me it would be ideal to establish a max allowed screen/tmux session time, automatically kill sessions that exceed it and alert only if that didn't work for some reason.

Change 381456 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: un-whitelist puppetmaster

https://gerrit.wikimedia.org/r/381456

Change 381456 merged by Dzahn:
[operations/puppet@production] screen-monitor: un-whitelist puppetmaster

https://gerrit.wikimedia.org/r/381456

Change 381461 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: whitelist cluster::management

https://gerrit.wikimedia.org/r/381461

Change 381461 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist cluster::management

https://gerrit.wikimedia.org/r/381461

Change 381462 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] screen-monitor: whitelist analytics hosts

https://gerrit.wikimedia.org/r/381462

  • ms-fe1005 should be whitelisted until T162123 is done

Done

  • I don't think puppetmasters should be whitelisted, all reimage stuff is now in sarin/neodymium only

Done

  • Although reimage stuff is done on sarin/neodymium, they should never reach the threshold IF people close them after the reimage. But I think those two hosts are also used as MySQL management hosts to run long-running tasks, so probably should be whitelisted as per DBAs request.

sarin and neodymium used to be whitelisted but by role salt master, and that role is gone now since we just removed salt. So they weren't covered anymore.

Done. They are now whitelisted by role cluster::management.

Change 381462 abandoned by Elukey:
screen-monitor: whitelist analytics hosts

https://gerrit.wikimedia.org/r/381462

Change 381469 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] screen-monitor: whitelist stat and druid hosts

https://gerrit.wikimedia.org/r/381469

Change 381469 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist stat and druid hosts

https://gerrit.wikimedia.org/r/381469

Change 381474 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] screen-monitor: remove druid from whitelisted hosts

https://gerrit.wikimedia.org/r/381474

Change 381474 merged by Elukey:
[operations/puppet@production] screen-monitor: remove druid from whitelisted hosts

https://gerrit.wikimedia.org/r/381474

Change 381501 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: whitelist mediawiki maintenance servers

https://gerrit.wikimedia.org/r/381501

Change 381501 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist mediawiki maintenance servers

https://gerrit.wikimedia.org/r/381501

Change 381502 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: raise WARN threshold to 24 hours

https://gerrit.wikimedia.org/r/381502

Change 381502 merged by Dzahn:
[operations/puppet@production] screen-monitor: raise WARN threshold to 24 hours

https://gerrit.wikimedia.org/r/381502

Change 381504 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: whitelist wdqs hosts

https://gerrit.wikimedia.org/r/381504

Change 381504 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist wdqs hosts

https://gerrit.wikimedia.org/r/381504

Today there are ~20 unhandled screen/tmux problems in icinga. Maybe this number will decrease after handling the initial problems

Reduced to 3 :)

Some of them i closed (logged in SAL), some of them i whitelisted (changes above), this includes mediawiki_maintenance, wdqs and stat*)

remaining:

puppetmaster1001 (not whitelisted anymore per Volans's comment above)

tin (deployment_servers should probably be whitelisted ?)

rhenium (special project, user dkg)

And that's it now...

  • I don't think puppetmasters should be whitelisted

You have one running there yourself, still need it ?:) There are some more by @Marostegui and @fgiunchedi but they look old (mostly).

@Dzahn Closed mine, thanks for noticing.

  • I don't think puppetmasters should be whitelisted

You have one running there yourself, still need it ?:) There are some more by @Marostegui and @fgiunchedi but they look old (mostly).

Got rid of mine!
Thanks for the heads up!

Change 382026 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: whitelist deployment servers

https://gerrit.wikimedia.org/r/382026

Change 382026 merged by Dzahn:
[operations/puppet@production] screen-monitor: whitelist deployment servers

https://gerrit.wikimedia.org/r/382026

There are no more alerts now. All the screens/tmux on puppetmaster are closed now. Deployment servers and rhenium are whitelisted. So right now Icinga is clean.

I am guessing we can finally close this ?

Yes, eh.. tentatively closing :) Of course we can still comment here and reopen if necessary.

I just see minor adjustments to thresholds or whitelists in the future but that doesn't justify keeping the ticket open indefinitely. Feel free to disagree.

Change 384637 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] screen-monitor: raise WARN to 4 days, lower CRIT to 20 days

https://gerrit.wikimedia.org/r/384637

Change 384637 merged by Dzahn:
[operations/puppet@production] screen-monitor: raise WARN to 4 days, lower CRIT to 20 days

https://gerrit.wikimedia.org/r/384637

Change 427195 had a related patch set uploaded (by Dzahn; owner: Herron):
[operations/puppet@production] icinga: extend screen/tmux warning time from 4 days to 10 days

https://gerrit.wikimedia.org/r/427195

Change 427195 merged by Dzahn:
[operations/puppet@production] icinga: extend screen/tmux warning time from 4 days to 10 days

https://gerrit.wikimedia.org/r/427195

jcrespo added a subscriber: ayounsi.

I said that this is was going to lead to people annoying other people for things that are non impacting, and I agreed to the change because I was sworn that this was only going to be a tool to detect bad patterns, but that SREs were never going to actively ping other people for just having things running for a few hours (it was considered only an issue if it was left like that for months).

Today I got pinged by @ayounsi for a WARNING running for a few hours, which was the opposite of what was consensued here of what this was going to be used for, so I would like to reopen the conversation about the convenience of this WARNINGS. Or at least document how they were supposed to be used.

Today I got pinged by @ayounsi for a WARNING running for a few hours

For the record:

WARNING - (for 2d 15h 51m 27s) - Status Information: WARN: Long running SCREEN process. (user: root PID: 13601, 1089352s > 864000s).

1089352s = 12 days.
I'd not ping someone about a tmux running for a few hours.

Dzahn removed Dzahn as the assignee of this task.May 18 2020, 8:40 AM

I don't see the issue here given that there is an easy way to exclude some hosts and that has already happened.

@Dzahn I think documenting how one is supposed to use the WARNINGS (to adopt some of my feedback) and document the general idea of what not to worry about (e.g. screens running on databases) would be my criteria to resolve this. I think that is a reasonable request :-D.

Right now, this is the opposite to what I agreed: https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens

I think "how to handle Icinga warnings" is not something specific to this task about monitoring screens.

The part that even after 12 days it is merely a warning and _not_ a critical already is a reaction to concerns you are raising.

Defining that "on some hosts this is normal" has also been implemented with a simple Hiera switch and you have just used it.

I don't know what else there is that can be done here.

Dzahn claimed this task.

Ok, resolving. Note: The thresholds are currently set to 240 hours (10 days) for WARN and 480 hours (20 days) for CRIT.

For context, I was opposed to this being on icinga (NOT the concept itself) because I was worried about icinga spam and pings from other users stressing SREs. I compromised because Daniel improved (in my opinion) the proposal with the added whitelist and the promise that people were going to bee "cool" about them. Whitelist was implemented, "coolness" factor was known years ago, but not documented for newer SREs.

That is now fixed. Thanks, @Dzahn.

As a long term thing (not in scope of this) I would like to separate better "things that can wait" from "outages or potential outages" on different dashboards.

I would like to separate better "things that can wait" from "outages or potential outages" on different dashboards.

I agree. CRIT and WARN are not the same thing. WARNs count as "things that can wait" for me. If they can't wait they should be CRITs.

The thing is though that our custom URL https://icinga.wikimedia.org/alerts redirects to a page that shows both at the same time.

If we'd use these URLs separately though..

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=4&hoststatustypes=3

..then we already have separate dashboards for "things that can wait" vs "outages".

The standard Icinga UI separates these for that reason and that's why it has yellow numbers and red numbers.

In my opinion we should focus most on the red numbers (the CRIT dashboard).

I don't want to write more here because it is out of topic- I agree with everything you say, but let me go in a different direction:

The main issue is that currently icinga has only 4 levels (ignore the details as this may not be 100% correct):

  • Everything is ok (GREEN)
  • Potential outage or "attend or it could become an outage soon" (WARNING)
  • Service outage (CRITICAL)
  • fatal outage with user imapct that everybody should stop and attend

I think there is room for a class of passive monitoring of "things that you should know that are not outages".

Long running screens would be one. In my field "backups failed" would be another. If a backup fails, I don't care- it will just retry. But if it fails continuously it can end up being a page level- but by then it is too late (we need more shades of gray).

Maybe the solution is starting doing more stuff on prometheus rather than icinga, and tools on top of that with dashboards for different teams etc. We can ask observability team if they are already working towards a better model or to support these use cases.