Page MenuHomePhabricator

Remove the "Long running screen/tmux" Icinga check
Closed, ResolvedPublic

Description

I propose we remove the "Long running screen/tmux" Icinga check, I think in reality it hasn't been useful enough for the massive amount of alert spam it causes. I was probably useful in the past, but at this point I think it only leads to alert fatigue.

At any given time we have a handful of those in warn or alert state and it's never really actionable, because simply someone in SRE is doing some maintenance and for a good reason.

Event Timeline

+1 to this.
I only found this useful in a case were we offboarded someone and I found a screen (with no activity) days after, but this was a looooong time ago.

+1 I don't remember it being useful once for me, while annoying me plenty of times.

And now we have the logout cookbook too, so even during offboarding it's not really useful.

I'm also +1 on ditching the alert

If I remember correctly, originally it was introduced to detect/prevent cases where a recurring DB maintenance tasks was running in a user's screen session continuously

I don't think that was the case, given I was part of the DBA team back in 2017 and I was strongly opposed to the change: T165348#3607162 plus databases were excluded from the check from the beginning: https://gerrit.wikimedia.org/r/c/operations/puppet/+/377823 Maybe you mean mw maintenance?

I dug deeper and the causes seems to be a 2017 incident mentioned on the meeting notes as:

screen "api-hhvm-restarts" on neodymium restarted a bunch of api servers on Fri (screen from 2016, now stopped)

I was (too) passionately against this being on icinga, and I think, if someone argues in favor of not removing it completely, we now have way better methods of reporting, something more dashboard-like and aggregated, like we have for failed puppet runs or failed prometheus scraping jobs.

Another thing is to encourage people to restart servers with some frequency (not only to remove old screens, but more important to update kernels and make sure hw is not faulty). As well as building tools to automate/puppetize workflows when reasonable (e.g. long running mw deployer's maintenance).

I dug deeper and the causes seems to be a 2017 incident mentioned on the meeting notes as:

screen "api-hhvm-restarts" on neodymium restarted a bunch of api servers on Fri (screen from 2016, now stopped)

Seems I misremembered, thanks for the archeology work :-)

+1 to removing the check. We also have since enabled shell TMOUT which helps clean up cases where shells are left idle. Currently that's a 5 day timeout.

Change 712123 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable the \"long running screen/tmux session\" check by default

https://gerrit.wikimedia.org/r/712123

Change 712123 merged by Jbond:

[operations/puppet@production] Disable the \"long running screen/tmux session\" check by default

https://gerrit.wikimedia.org/r/712123

Marostegui assigned this task to jbond.

Change has been merged by John so closing this.

Change 723543 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] monitoring: drop monitor_screens parameter

https://gerrit.wikimedia.org/r/723543

jbond reopened this task as In Progress.Fri, Sep 24, 3:35 PM

the merged change was just to make it disabled by default there is another coming to remove it

Change 723988 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove Hiera entries for screen/tmux monitoring

https://gerrit.wikimedia.org/r/723988

Change 723988 merged by Muehlenhoff:

[operations/puppet@production] Remove Hiera entries for screen/tmux monitoring

https://gerrit.wikimedia.org/r/723988

Change 723543 merged by Jbond:

[operations/puppet@production] monitoring: drop monitor_screens parameter

https://gerrit.wikimedia.org/r/723543