in Cloud-Services:
16:38 <Niharika> sitic: Crosswatch gives a 404. What's up?
in Cloud-Services:
16:38 <Niharika> sitic: Crosswatch gives a 404. What's up?
webservice is not running according to webservice status:
tools.crosswatch@tools-bastion-01:~$ webservice status Your webservice is not running
yet there is nothing indicating that in error.log:
tools.crosswatch@tools-bastion-01:~$ tail error.log 08-30 09:27 tornado.access INFO 304 GET /crosswatch/i18n/en-f4bee95c.json (10.68.17.145) 2.18ms 08-30 09:27 tornado.access INFO 304 GET /crosswatch/i18n/de-f4bee95c.json (10.68.17.145) 1.54ms 08-30 09:27 tornado.access INFO 200 GET /crosswatch/sockjs/info?t=1440926831305 (10.68.17.145) 1.14ms 08-30 09:37 tornado.access INFO 200 HEAD /crosswatch/ (10.68.17.145) 542969.23ms 08-30 09:38 tornado.access INFO 200 HEAD /crosswatch/ (10.68.17.145) 0.98ms 08-30 09:42 tornado.access INFO 200 HEAD /crosswatch/ (10.68.17.145) 18802.51ms 08-30 09:50 tornado.access INFO 200 HEAD /crosswatch/ (10.68.17.145) 128102.72ms 08-30 09:50 tornado.access INFO 200 GET /crosswatch/ (10.68.17.145) 363.52ms 08-30 09:50 tornado.access INFO 200 HEAD /crosswatch/ (10.68.17.145) 1.11ms 08-31 19:33 root INFO Starting tornado server on port 52883.
I tried starting the server with crosswatch/scripts/start_webserver.sh, and then realized qstat actually indicated there was already a server running with a different name:
tools.crosswatch@tools-bastion-01:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 50481 0.47255 celery1 tools.crossw Rr 10/09/2015 18:28:41 continuous@tools-exec-1405.eqi 1 50483 0.47255 celery2 tools.crossw Rr 10/09/2015 18:28:58 continuous@tools-exec-1402.eqi 1 50485 0.47255 celery3 tools.crossw Rr 10/09/2015 18:28:58 continuous@tools-exec-1405.eqi 1 50487 0.47255 celery4 tools.crossw Rr 10/09/2015 18:30:46 continuous@tools-exec-1401.eqi 1 507233 0.48018 tornado-cr tools.crossw Rr 10/09/2015 19:12:25 webgrid-generic@tools-webgrid- 1
so I qmod -Rj 507233 , but that didn't seem to have any effect (not on error.log nor on the https://tools.wmflabs.org/crosswatch/)
Oh, that's because it logs to /data/project/crosswatch/tornado-crosswatch.err instead... which says
`
/var/spool/gridengine/execd/tools-webgrid-generic-1403/job_scripts/507233: line 2: /data/project/crosswatch/crosswatch/scripts/_sge_webserver.sh: No such file or directory
`
so... I don't know? The setup is so non-standard I'm not going to dig into this further.
tools.crosswatch@tools-bastion-01:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- (...) 1899137 0.36197 generic-cr tools.crossw Eqw 11/24/2015 01:03:08 1 tools.crosswatch@tools-bastion-01:~$ qstat -j 1899137 (...) error reason 1: 12/16/2015 08:28:00 [52519:1140]: can't stat() "/data/project/crosswatch/error.log" as stdout_path: tools.crosswatch@tools-bastion-01:~$ qmod -cj 1899137 tools.crosswatch@tools-bastion-01.eqiad.wmflabs cleared error state of job 1899137
and it's back online now.
Now i often get: "No watchlist could be retrieved. This is an internal server error, please open a bug report if the problem persists." Does somebody else is experience this too? If so it may be worth to open a separate ticket then.
We are back to this error again:
/var/spool/gridengine/execd/tools-webgrid-generic-1403/job_scripts/507233: line 2: /data/project/crosswatch/crosswatch/scripts/_sge_webserver.sh: No such file or directory
And as before, there's not much I can do. @Sitic, please remember that you, as tool owner, are responsible to keep this tool running. The admins try to help out where possible (especially after large disturbances such as yesterday), but with such a non-standard setup, this is very difficult. If you could document the setup, this would help a lot. In the longer term, switching to Docker/kubernetes might be a good idea.
OK, so after another 45 minutes of puzzling, that error was a red herring (old log file). The actual issue was with the different kinds of webservice commands, and webservicewatcher (which restarts jobs) doesn't understand this.
How to restart crosswatch: run
$ webservice-new restart
as tools.crosswatch. I've also documented this in /data/project/crosswatch/README.md.