Page MenuHomePhabricator

sge-status
Closed, ResolvedPublic

Description

Error
Ouch. Something went horribly wrong. Hopefully there is a good explanation in the error logs.

Error ID: o822989q-841c77fe
tools.sge-status@tools-sgebastion-07:~$ webservice status
Your webservice of type lighttpd is running
tools.sge-status@tools-sgebastion-07:~$ webservice restart
Restarting webservice....................
tools.sge-status@tools-sgebastion-07:~$ webservice status
Your webservice of type lighttpd is running
tools.sge-status@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 348755 0.25000 lighttpd-s tools.sge-st r     02/25/2019 11:57:28 webgrid-lighttpd@tools-sgewebg     1
{"@timestamp":"2019-02-25T11:57:53.817214+00:00","@version":1,"host":"tools-sgewebgrid-lighttpd-0902","message":"A non well formed numeric value encountered","type":"app","channel":"app","level":"CRITICAL","url":"/sge-status/","ip":"172.16.6.39","http_method":"GET","server":"tools.wmflabs.org","referrer":null,"uid":"71082b7","process_id":3575,"exception":{"class":"ErrorException","message":"A non well formed numeric value encountered","code":8,"file":"/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/src/Qstat.php:180","trace":["/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/src/Qstat.php:180","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/src/Qstat.php:101","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/src/StatusPage.php:62","{\"function\":\"handleGet\",\"class\":\"Tools\\\\GridEngineStatus\\\\StatusPage\",\"type\":\"->\",\"args\":[]}","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/wikimedia/slimapp/src/Controller.php:122","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/src/App.php:124","{\"function\":\"Tools\\\\GridEngineStatus\\\\{closure}\",\"class\":\"Tools\\\\GridEngineStatus\\\\App\",\"type\":\"->\",\"args\":[]}","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/slim/slim/Slim/Route.php:468","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/slim/slim/Slim/Slim.php:1355","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/slim/slim/Slim/Middleware/Flash.php:85","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/slim/slim/Slim/Middleware/MethodOverride.php:92","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/wikimedia/slimapp/src/HeaderMiddleware.php:67","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/wikimedia/slimapp/src/CsrfMiddleware.php:69","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/slim/slim/Slim/Slim.php:1300","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/vendor/wikimedia/slimapp/src/AbstractApp.php:171","/mnt/nfs/labstore-secondary-tools-project/sge-status/tool-gridengine-status/public/index.php:45"]},"errorId":"o822989q-841c77fe"}

Event Timeline

sge-jobs can't start due to a missing .so file:

tools.sge-jobs@tools-sgebastion-07:~$ kubectl logs sge-jobs-1683234189-mzlqw
open("/usr/lib/uwsgi/plugins/python_plugin.so"): No such file or directory [core/utils.c line 3664]
!!! UNABLE to load uWSGI plugin: /usr/lib/uwsgi/plugins/python_plugin.so: cannot open shared object file: No such file or directory !!!
[uWSGI] getting INI configuration from /data/project/sge-jobs/www/python/uwsgi.ini

Hmmm... cat disease from T161459 is spreading quickly :D

bd808 added a project: Tools.
bd808 subscribed.

sge-jobs can't start due to a missing .so file:

tools.sge-jobs@tools-sgebastion-07:~$ kubectl logs sge-jobs-1683234189-mzlqw
open("/usr/lib/uwsgi/plugins/python_plugin.so"): No such file or directory [core/utils.c line 3664]
!!! UNABLE to load uWSGI plugin: /usr/lib/uwsgi/plugins/python_plugin.so: cannot open shared object file: No such file or directory !!!
[uWSGI] getting INI configuration from /data/project/sge-jobs/www/python/uwsgi.ini

The missing uWSGI plugin is a falkse positive. We use the same uWSGI config for python2 and python3 on Kubernetes. One or the other plugin will fail on each Docker image, but it does not cause problems for uWSGI itself.

The same for sge-jobs

sge-jobs fetches data from sge-status via HTTPS, so it is sort of expected to have problems with sge-jobs when sge-status is broken.

Fixed by R1921:debe524a5e73: Cast strings to int before doing math which patches for a PHP7.1+ behavior change.

My guess about this starting to show up as a problem is that the https://tools.wmflabs.org/sge-status/ webservice Stretch grid job was started before we rolled out the PHP 7.2 runtime so it was running PHP 7.0. Then it was restarted, probably related to exec nodes being drained, and came up running PHP 7.2. The warning message that the app was converting to an ErrorException would have been introduced in PHP 7.1.