Page MenuHomePhabricator

Deprecate pybal SSH health checks
Open, NormalPublic

Description

Creating a ticket from what was discussed on IRC yesterday:

<bblack> does pybal RunCommand actually buy us anything in terms of pybal depooling a server more reliable in case of failure, etc?
<bblack> we only use it for the mw hosts (apaches, rendering, api), and it basically ssh's the "uptime" command - but those use IdleConnection and fetches as well
<bblack> just seems like a lot of complexity for a tertiary check that isn't directly checking the served service anyways - all the infrastructure for it with the authorized_keys setup and the user, and the config, and now firewall stuff because it's the exception for where ssh traffic could come from...
<bblack> paravoid: ^ ? thoughts
<paravoid> I guess not :)
<paravoid> but ask mark
<paravoid> as there's probably some bit of history there
<bblack> mark: ?
<mark> i'm not sure what the question is
<mark> yes, it was nice to have a generic check for a script
<mark> as we used for mediawiki apache checking
<mark> and that helped a lot back then
<mark> but i don't think we use it anywhere else
<bblack> I guess the question is more about actively configuring it for the mw* hosts today, not about the pybal feature
<mark> why wouldn't we?
<mark> it was added mostly for checking failure to deploy to new hosts, particularly due to disk failures
<mark> s/new//
<mark> if we can safely do that in other ways we can do without, but is it in the way at all?
<bblack> mark: mostly I only started looking and thinking because it's an annoying issue for SSH firewall rules. We want to enforce that ssh only comes from a handful of places for most servers. Adding rules for "all of the IPs that LVS could source a RunCommand ssh connection from" turns out to be annoying and complicated...
<mark> ah i see
<mark> well it did really help avoid pooling servers that were broken and not up to date
<mark> i wouldn't call it the end all solution to that either, but until we have a firmer alternative in place, I think we should keep it
<mark> it could be in scope for next quarter goal though
<bblack> right now all it's really validating is that ssh and /bin/sh and /usr/bin/uptime function correctly
<bblack> and I guess implicitly that puppetization was at least partly successful, enough to create the pybal-check user and its .ssh/authorized_keys
<mark> yes, but if you have a drive failure, those don't work correctly
<mark> which means that deployments with scap fail(ed)
<mark> and apache would happily keep serving traffic
<bblack> yeah. it would be nice if we could check that the service is operating correctly itself, though.
<mark> yes
<bblack> I mean in theory if there's no way to make the HTTP checks catch that issue, then technically the HTTP service doesn't need disks heh
<mark> so we've talked in the past about explicitly checking for the latest deployment on each host through pybal
<mark> and it would be great if that could be in scope for the next quarter goal :)
<bblack> yeah
<mark> and then we could get rid of this I guess
<bblack> maybe the subtask on that is to implement a self-check URL in mediawiki that looks at scap-level issues somehow
<bblack> I'm not sure what we'd compare against to catch it in a very realtime-y way though
<bblack> or push this from the other end: if an scap is mostly successful but fails on one or a few hosts, we could directly icinga-alert on that so that we know to depool it. Or take it a scary step further and auto-depool it in confd.
<bblack> or more-probably, scap will be tied into a rapid depool->update->repool via confd anyways, and it just leaves it dead if scap didn't work

Details

Related Gerrit Patches:
operations/puppet : productionRemove ferm rules for Pybal health checks
operations/puppet : productionLVS for MW: Remove RunCommand checks

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 9 2015, 6:53 AM
Dzahn triaged this task as Normal priority.Sep 14 2015, 11:29 PM
Dzahn added a subscriber: Dzahn.
BBlack moved this task from Triage to LoadBalancer on the Traffic board.Sep 30 2016, 2:03 PM
MoritzMuehlenhoff added a comment.EditedOct 18 2018, 8:10 AM

Are we ready to deprecate this now? We have disk health checks in place via Icinga for a while now which would warn us about faulty disks.

Change 536581 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS for MW: Remove RunCommand checks

https://gerrit.wikimedia.org/r/536581

Change 536581 merged by BBlack:
[operations/puppet@production] LVS for MW: Remove RunCommand checks

https://gerrit.wikimedia.org/r/536581

Still TODO here before resolving: remove the ferm puppetization on the MW hosts that was allowing LVS ssh access

Change 537446 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove ferm rules for Pybal health checks

https://gerrit.wikimedia.org/r/537446