Page MenuHomePhabricator

Crosswatch 404s / webservice down
Closed, ResolvedPublic

Description

in Cloud-Services:

16:38 <Niharika> sitic: Crosswatch gives a 404. What's up?

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: crosswatch.
valhallasw added subscribers: valhallasw, Niharika.

webservice is not running according to webservice status:

tools.crosswatch@tools-bastion-01:~$ webservice status
Your webservice is not running

yet there is nothing indicating that in error.log:

tools.crosswatch@tools-bastion-01:~$ tail error.log
08-30 09:27 tornado.access INFO     304 GET /crosswatch/i18n/en-f4bee95c.json (10.68.17.145) 2.18ms
08-30 09:27 tornado.access INFO     304 GET /crosswatch/i18n/de-f4bee95c.json (10.68.17.145) 1.54ms
08-30 09:27 tornado.access INFO     200 GET /crosswatch/sockjs/info?t=1440926831305 (10.68.17.145) 1.14ms
08-30 09:37 tornado.access INFO     200 HEAD /crosswatch/ (10.68.17.145) 542969.23ms
08-30 09:38 tornado.access INFO     200 HEAD /crosswatch/ (10.68.17.145) 0.98ms
08-30 09:42 tornado.access INFO     200 HEAD /crosswatch/ (10.68.17.145) 18802.51ms
08-30 09:50 tornado.access INFO     200 HEAD /crosswatch/ (10.68.17.145) 128102.72ms
08-30 09:50 tornado.access INFO     200 GET /crosswatch/ (10.68.17.145) 363.52ms
08-30 09:50 tornado.access INFO     200 HEAD /crosswatch/ (10.68.17.145) 1.11ms
08-31 19:33 root         INFO     Starting tornado server on port 52883.

I tried starting the server with crosswatch/scripts/start_webserver.sh, and then realized qstat actually indicated there was already a server running with a different name:

tools.crosswatch@tools-bastion-01:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  50481 0.47255 celery1    tools.crossw Rr    10/09/2015 18:28:41 continuous@tools-exec-1405.eqi     1
  50483 0.47255 celery2    tools.crossw Rr    10/09/2015 18:28:58 continuous@tools-exec-1402.eqi     1
  50485 0.47255 celery3    tools.crossw Rr    10/09/2015 18:28:58 continuous@tools-exec-1405.eqi     1
  50487 0.47255 celery4    tools.crossw Rr    10/09/2015 18:30:46 continuous@tools-exec-1401.eqi     1
 507233 0.48018 tornado-cr tools.crossw Rr    10/09/2015 19:12:25 webgrid-generic@tools-webgrid-     1

so I qmod -Rj 507233 , but that didn't seem to have any effect (not on error.log nor on the https://tools.wmflabs.org/crosswatch/)

Oh, that's because it logs to /data/project/crosswatch/tornado-crosswatch.err instead... which says

`
/var/spool/gridengine/execd/tools-webgrid-generic-1403/job_scripts/507233: line 2: /data/project/crosswatch/crosswatch/scripts/_sge_webserver.sh: No such file or directory
`

so... I don't know? The setup is so non-standard I'm not going to dig into this further.

Yeah, seems to be back up now. Thanks @valhallasw for looking into this!

FriedhelmW subscribed.

Four hundred and four again!

Works again, mysteriously.

tools.crosswatch@tools-bastion-01:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
(...)
1899137 0.36197 generic-cr tools.crossw Eqw   11/24/2015 01:03:08                                    1

tools.crosswatch@tools-bastion-01:~$ qstat -j 1899137 
(...)
error reason    1:          12/16/2015 08:28:00 [52519:1140]: can't stat() "/data/project/crosswatch/error.log" as stdout_path:

tools.crosswatch@tools-bastion-01:~$ qmod -cj 1899137
tools.crosswatch@tools-bastion-01.eqiad.wmflabs cleared error state of job 1899137

and it's back online now.

Now i often get: "No watchlist could be retrieved. This is an internal server error, please open a bug report if the problem persists." Does somebody else is experience this too? If so it may be worth to open a separate ticket then.

Yes, please create a seperate bug for it.

Oh, that's because it logs to /data/project/crosswatch/tornado-crosswatch.err instead... which says

`
/var/spool/gridengine/execd/tools-webgrid-generic-1403/job_scripts/507233: line 2: /data/project/crosswatch/crosswatch/scripts/_sge_webserver.sh: No such file or directory
`

so... I don't know? The setup is so non-standard I'm not going to dig into this further.

We are back to this error again:

/var/spool/gridengine/execd/tools-webgrid-generic-1403/job_scripts/507233: line 2: /data/project/crosswatch/crosswatch/scripts/_sge_webserver.sh: No such file or directory

And as before, there's not much I can do. @Sitic, please remember that you, as tool owner, are responsible to keep this tool running. The admins try to help out where possible (especially after large disturbances such as yesterday), but with such a non-standard setup, this is very difficult. If you could document the setup, this would help a lot. In the longer term, switching to Docker/kubernetes might be a good idea.

OK, so after another 45 minutes of puzzling, that error was a red herring (old log file). The actual issue was with the different kinds of webservice commands, and webservicewatcher (which restarts jobs) doesn't understand this.

How to restart crosswatch: run

$ webservice-new restart

as tools.crosswatch. I've also documented this in /data/project/crosswatch/README.md.

This comment was removed by Aschroet.