Page MenuHomePhabricator

fix and move check_job_queue Icinga check to eqiad (relied on NFS on spence)
Closed, ResolvedPublic

Description

<nagios-wm> RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK -
include(/usr/local/apache/common/php-1.17/includes/DefaultSettings.php): failed
to open stream: No such file or directory in
/home/wikipedia/common/php-1.17/wmf-config/CommonSettings.php on line 78
Warning: include(): Failed opening
/usr/local/apache/common/php-1.17/includes/DefaultSettings.php for inclusion
(include_path=/usr/local/apache/common/php-1.17:/usr/local/apache/co
That's not an OK, that's a fatal PHP error. It probably indicates the entire
source tree is missing. I don't have SSH access to spence so I can't
investigate, but I can tell that it's not in the mediawiki-installation node
list, so it doesn't get syncs, so it probably would have broken when we moved
to php-1.17.

Details

Reference
rt1128

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 12:53 AM
rtimport added a project: ops-core.
rtimport set Reference to rt1128.

Roan, the original script was not optimal as you described because it could
even return an OK when the output was not even an integer. (you can see the
source in the gerrit link below)
To fix that i added this:
https://gerrit.wikimedia.org/r/#patch,sidebyside,1086,1,files/nagios/check_job_queue
Also changed the path from php-1.5 to php-1.17 as the PHP scripts now reside
there, but now we have this follow-up problem:
Unable to load dynamic library '/usr/lib/php5/20090626/apc.so' -
/usr/lib/php5/20090626/apc.so: cannot open shared object file: No such file or
directory
and it would currently result (not live) in this:
JOBQUEUE CRITICAL - check plugin (lala) or PHP errors - MWMultiVersion instance
initialized! MWScript.php wrapper not used? plwiki
on spence, apc.so is instead in: /usr/lib/php5/20060613/apc.so because it is
installed from the distro package: (dpkg -L php5-apc , dpkg -l | grep apc ) ...
Wanna review my change and take a look at it from here and maybe make it
customizable?

Status changed from 'new' to 'open' by dzahn

On Fri Nov 25 11:02:03 2011, dzahn wrote:

Also changed the path from php-1.5 to php-1.17 as the PHP scripts now
reside
there, but now we have this follow-up problem:

They don't actually live there any more either, but that's OK. I've changed the
script to use the mwscript wrapper in https://gerrit.wikimedia.org/r/1087 , so
it now uses the appropriate version of the script for each wiki automatically.

Unable to load dynamic library '/usr/lib/php5/20090626/apc.so' -
/usr/lib/php5/20090626/apc.so: cannot open shared object file: No such
file or
directory

and it would currently result (not live) in this:

JOBQUEUE CRITICAL - check plugin (lala) or PHP errors - MWMultiVersion
instance
initialized! MWScript.php wrapper not used? plwiki

on spence, apc.so is instead in: /usr/lib/php5/20060613/apc.so because
it is
installed from the distro package: (dpkg -L php5-apc , dpkg -l | grep
apc ) ...

Wanna review my change and take a look at it from here and maybe make
it
customizable?

My commit should have fixed the MWMultiVersion error. The APC error seems to
still be present:
<catrope at spence:~$ php -v
PHP Deprecated: Comments starting with '#' are deprecated in
/etc/php5/cli/conf> cannot open
shared object file: No such file or directory in Unknown on line 0
I'll see if I can fix that.

On Fri Nov 25 11:15:34 2011, catrope wrote:

The APC error
seems to
still be present:

<catrope at spence:~$ php -v
PHP Deprecated: Comments starting with '#' are deprecated in
/etc/php5/cli/conf> so someone needs to

figure out which ones are needed and put them in puppet.

Tim commented on private-l on the fenari upgrade that php-apc is actually the
right package, not php5-apc, as Ubuntu seems to start dropping the "5" from
package names one by one..
Meanwhile they are both NOT installed on spence anymore.
Merging our gerrit changes.. let's see if that fixes it now.

* 11:52 mutante: spence still slow, restarted nagios (as opposed to just
reloading)
* 11:03 mutante: spence changed check_interval for check_job_queue from 1 to
15 (min) as it takes ~5 minutes to complete currently - reloaded nagios
* 10:52 mutante: check_job_queue Nagios check finally works again :) - but
why do we check for "3rd biggest" (fix me)
* 10:43 RoanKattouw: Added spence to
fenari:/etc/dsh/group/mediawiki-installation
* 10:41 mutante: spence - "chown -R mwdeploy:mwdeploy
/usr/local/apache/common-local" per Roan, sync-common again, finished
fine..after a while. (spence had really outdated copy of MW source tree)
* 10:31 mutante: ran sync-common from spence and got a ton of permission
errors
* 10:29 mutante: spence added root to sudoers to fix sync-common
* 10:25 mutante: spence created mwdeploy user to fix sync-common
* 10:17 mutante: spence - fix comment style in apc.ini (# -> //) to get rid
of 'PHP Deprecated' messages - install php-apc for check_job_queue
* 10:11 mutante: copied script files provided by wikimedia-task-appserver
from fenari to spence
<-- this was all related to getting check_job_queue to work again.
11:49 < mutante> JOBQUEUE OK - plwiki has the 3rd biggest job queue, 42 entries
:) yay
11:49 < RoanKattouw> whee
And this should be part of the long-term fix:
https://gerrit.wikimedia.org/r/#change,1472
https://gerrit.wikimedia.org/r/#change,1473

- puppetized it, fixed path, fixed duplicate definition, got rid of
non-puppetized definition, made check interval configurable,..etc..
- Roan completely rewrote the check itself
and now it works for real:
http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue
:)

Status changed from 'open' to 'resolved' by dzahn

Status changed from 'resolved' to 'open' by dzahn

need to reopen.
it is broken since we switched from spence to neon.
because it relies on having /h/w/ and MWScript.php , which neon does not have
needs to run elsewhere, like on the eqiad versions of hume or fenari.
see bug 36835 https://bugzilla.wikimedia.org/show_bug.cgi?id=36835
<root at neon:/usr/lib/nagios/plugins# >
JOBQUEUE OK - all job queues below 10,000

Subject changed from 'check_job_queue Nagios check on spence throws PHP fatal and interprets it as OK' to 'fix and move check_job_queue Icinga check to eqiad (relied on NFS on spence)' by dzahn

arg, we need to bump this up.
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=job_queue
but do we really want to make neon a mediawiki install? i'd rather not and run
this command on a random appserver via ssh

Terbium is the new hume. How about there? It's set up as an app server already.

py wrote:

jobqueue check now on terbium:
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=terbium&nostatusheader
should also be on hume.

Status changed from 'open' to 'resolved' by py

Dzahn changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Apr 19 2024, 7:21 PM
Dzahn changed the edit policy from "WMF-NDA (Project)" to "All Users".