Page MenuHomePhabricator

Restart webservice for /magnustools/
Closed, ResolvedPublic

Description

Many of Magnus' tools are failing because /magnustools/ is offline, e.g.

https://tools.wmflabs.org/reasonator/?q=Q5248433

fails because all included .js files are stored under https://tools.wmflabs.org/magnustools/

  • 22 feb 15 - original issue
  • 29 mar 15 - reopened; down again; apparently bigbrother did not restart it again?

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added projects: Toolforge, Tools.
valhallasw added subscribers: valhallasw, Magnus, coren, GerardM.
scfc claimed this task.
scfc subscribed.

Someone or something seems to have restarted the webservice at 14:12Z:

tools.magnustools@tools-dev:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
8351418 0.30018 lighttpd-m tools.magnus r     02/22/2015 14:12:31 webgrid-lighttpd@tools-webgrid     1        
tools.magnustools@tools-dev:~$

There is a ~/.bigbrotherrc that predates that:

tools.magnustools@tools-dev:~$ ls -l .bigbrotherrc; cat .bigbrotherrc 
-rw-r--r-- 1 tools.magnustools tools.magnustools 28 Jan 28 02:19 .bigbrotherrc
webservice --release trusty
tools.magnustools@tools-dev:~$

but bigbrother last did something for this tool last Tuesday:

tools.magnustools@tools-dev:~$ ls -l bigbrother.log; tail -5 bigbrother.log 
-rw-r--r-- 1 root tools.magnustools 1423 Feb 17 11:18 bigbrother.log
2015-02-17 11:16:29 info: Restarting job 'lighttpd-magnustools'
Your webservice is already running
2015-02-17 11:18:09 warn: job 'lighttpd-magnustools' failed to start
2015-02-17 11:18:14 info: Restarting job 'lighttpd-magnustools'
Starting web service
tools.magnustools@tools-dev:~$

Looking at job #8293093, the webservice stopped at:

[…]
end_time     Sat Feb 21 17:47:51 2015
[…]
exit_status  0                   
[…]
maxvmem      3.762G
[…]

A cursory look at bigbrother's code does not show anything obvious why it did not restart the webservice. I'll peek a bit deeper later.

There is a difference between the webservice being up and rht functionality being up. The webservice was up but that is only half the story.
Thanks,

GerardM

The bigbrother process was (re-)started after it last restarted magnustools:

root@tools-submit:~# ls -dl /proc/$(pidof perl bigbrother)
dr-xr-xr-x 8 root root 0 Feb 17 18:45 /proc/851
root@tools-submit:~#

I just stopped the webservice of one of my tools, and bigbrother did not restart it. I'll restart bigbrother and try again.

And now bigbrother restarted my tool's webservice. So a) now everything is fine and b) strange.

Someone or something seems to have restarted the webservice at 14:12Z:

tools.magnustools@tools-dev:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
8351418 0.30018 lighttpd-m tools.magnus r     02/22/2015 14:12:31 webgrid-lighttpd@tools-webgrid     1        
tools.magnustools@tools-dev:~$

This was @yuvipanda:

15:12 <Yuvi|Vacation> !log tools.magnustools restarted webservice, had died many many times

As for why bigbrother wasn't operating: 17 feb was the labs downtime last week. Maybe there's a situation where bigbrother fails to correctly read .bigbrotherrc (say, because NFS isn't mounted yet) and then doesn't refresh the config because the config file is older than the 'last check time'?

Looking at syslog on tools-submit, the machine shut down and rebooted (?):

Feb 17 16:51:01 tools-submit CRON[14469]: (tools.geocommons) CMD (jsub -N update -once -mem 500m -quiet php geocommons-update.php)
Feb 17 18:45:25 tools-submit kernel: imklog 5.8.6, log source = /proc/kmsg started.

Immediately afterwards, there are lots of cron warnings that it cannot get the passwd entry for various existing user names. This leads me to suspect that bigbrother had the same (temporary) problems to resolve user names, perhaps defaulted to "" (nothing) for users' home directories, did not find .bigbrotherrcs there and happily did nothing. I'll file another bug to test and fix this.

valhallasw updated the task description. (Show Details)
valhallasw set Security to None.

There is no new entry in the tool's bigbrother.log, but no web service either. tools-submit was rebooted six days ago, and maybe it caused the same condition with bigbrother again? I'll restart the bigbrother service on tools-submit, then start the web service for magnustools, wait a minute, stop it and see if bigbrother restarts it now.

This comment was removed by scfc.

Actually, remembering caused me to run jsub sleep 30, which was enough to make bigbrother aware of magnustools again, and it has started the web service correctly.