Page MenuHomePhabricator

Flapping Phabricator processes monitoring
Closed, ResolvedPublic

Description

I found some alarms this morning in #wikimedia-releng IRC channel (times are in UTC):

02:16:23 <icinga-wm> PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
02:18:45 <icinga-wm> RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
02:39:37 <icinga-wm> PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
02:46:41 <icinga-wm> RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
04:46:41 <icinga-wm> PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
04:49:03 <icinga-wm> RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator

I am tempted to rule out the monitoring command which has been the same since 2015:

/usr/lib/nagios/plugins/check_procs -c 3:150 -u phd

Which issues a critical state when there are less than 3 processes owned by the phd user.

Event Timeline

I think we should prioritize the broken replication issue first.

Since replications is fixed... now.. back to this.

Note how it never says "0 processes", so it's not about phd failing.

Instead it is sometimes 2, then 4 then 11(!) processes.

The reason for that is it checks for _any_ process run by UID 497 and that is not the right way to check it.

What it is supposed to check is if the service phd is running or not. Not how many unrelated things are curently run by the user that has the same name.

For example you have these when repositories are being updated:

phd      16962  0.0  0.0   2384   764 ?        S    18:50   0:00 sh -c /srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/bin/repository update  -- rEWBA
phd      16963 32.5  0.2 270816 196848 ?       S    18:50   0:01 php /srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/bin/repository update -- rEWBA
phd      17111  0.0  0.0   2384   764 ?        S    18:50   0:00 sh -c /srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/bin/repository update  -- rEPNY
phd      17113 21.0  0.1 132688 66356 ?        S    18:50   0:00 php /srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/bin/repository update -- rEPNY

To explain why this wasn't an issue in the past and now is.. I think maybe that happened when checks were moved from icinga to alertmanager.

Because I am positive when this check was implemented in icinga originally it checked for the service.

If true this would be more of a ticket for observability.

The original NRPE command was/is:

nrpe::monitor_service { 'check_phab_taskmaster':
    description   => 'PHD should be supervising processes',
    nrpe_command  => '/usr/lib/nagios/plugins/check_procs -c 3:150 -u phd',

note how it accepts any value between 3 and 150 as OK.. this is how that was always ok before.

Now it clearly does not accept that range and instead alerts for anything that is not exactly 3 or so.

question is .. where did it break?

something is at odds here, see the existing puppet config above. It's how it's always been and that creates this file locally:

@phab1001:/etc/nagios/nrpe.d# cat check_check_phab_taskmaster.cfg 
# File generated by puppet. DO NOT edit by hand
command[check_check_phab_taskmaster]=/usr/lib/nagios/plugins/check_procs -c 3:150 -u phd

and manually running the NRPE command works as it has worked before:

  /usr/lib/nagios/plugins/check_procs -c 3:150 -u phd
PROCS OK: 20 processes with UID = 497 (phd) | procs=20;;3:150;0;

20 processes is fine because it's between 3 and 150.

Why are we getting alerts from icinga-wm that apparently do different things than what is configured in the NRPE snippet??

@fgiunchedi Know of anything that could explain why we see icinga-wm output that does not match what is in Nagios NRPE config?

commit 132db90ecdb9a499a05a2d346b843616d6af61ef
Author: Filippo Giunchedi <fgiunchedi@wikimedia.org>
Date:   Mon Jul 11 13:13:33 2022 +0200

    phabricator: switch to prometheus-only network probes/checks
    
    Also set severity to page

^ But this did not touch the NRPE process check.

something is at odds here, see the existing puppet config above. It's how it's always been and that creates this file locally:

@phab1001:/etc/nagios/nrpe.d# cat check_check_phab_taskmaster.cfg 
# File generated by puppet. DO NOT edit by hand
command[check_check_phab_taskmaster]=/usr/lib/nagios/plugins/check_procs -c 3:150 -u phd

and manually running the NRPE command works as it has worked before:

  /usr/lib/nagios/plugins/check_procs -c 3:150 -u phd
PROCS OK: 20 processes with UID = 497 (phd) | procs=20;;3:150;0;

20 processes is fine because it's between 3 and 150.

Why are we getting alerts from icinga-wm that apparently do different things than what is configured in the NRPE snippet??

@fgiunchedi Know of anything that could explain why we see icinga-wm output that does not match what is in Nagios NRPE config?

It seems to me that the nrpe / check_procs part is working as expected, i.e. alerting when there are less than two processes with uid phd. As to why every now and then there are only two processes, I have no idea.

The reason for that is it checks for _any_ process run by UID 497 and that is not the right way to check it.

What it is supposed to check is if the service phd is running or not. Not how many unrelated things are curently run by the user that has the same name.

I agree 100% with you here, please consider ditching the check_procs bit and rely on the generic systemd service alert (i.e. if one of the phd unit fails).

Change 832368 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: remove Icinga monitoring for phd supervising processes

https://gerrit.wikimedia.org/r/832368

Change 832368 merged by Dzahn:

[operations/puppet@production] phabricator: remove Icinga monitoring for phd supervising processes

https://gerrit.wikimedia.org/r/832368

Change 832371 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: remove absented code for icinga phd superiving procs

https://gerrit.wikimedia.org/r/832371

Change 832371 merged by Dzahn:

[operations/puppet@production] phabricator: remove absented code for icinga phd superiving procs

https://gerrit.wikimedia.org/r/832371

I agree 100% with you here, please consider ditching the check_procs bit and rely on the generic systemd service alert (i.e. if one of the phd unit fails).

Done! I simply removed this check (first absented, then puppet run on phab*, then on alert*, then removed absented code).