Page MenuHomePhabricator

Many grid engine backend webservices not registered at tools-proxy redis following depool restarts
Closed, ResolvedPublic

Description

@Steinsplitter reported to me that a few webservices started mysteriously giving 503 No webservice, without anything changed, and I thought if a webservice exits it should be restarted automatically. He pointed me to tool commons-delinquent and I looked:

tools.commons-delinquent@tools-sgebastion-08:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  60038 0.26217 lighttpd-c tools.common Rr    01/06/2020 22:59:07 webgrid-lighttpd@tools-sgewebg     1        
 931131 0.40062 demon      tools.common r     11/04/2019 19:25:57 continuous@tools-sgeexec-0919.     1        
2953225 0.31705 demon      tools.common Rr    12/22/2019 09:41:50 continuous@tools-sgeexec-0924.     1

But it is not in tools-proxy-05 redis.

So I looked, how many tools are active in grid but not in redis:

webgrid-lighttpd:

12:15:08 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ comm -23 <(qstat -u \* -q webgrid-lighttpd -xml | grep JB_owner | grep -oP '(?<=<JB_owner>tools\.).+(?=</JB_owner>)' | sort) <(curl -s tools-proxy-05:8081/list | jq . | grep -oP '(?<=").+(?=": {)' | sort)
ato
blockyquery
botriconferme
catgraph
cgstat
cluebotng
commons-delinquent
convert
deadlinks
derivative
dewikinews-rss
dispenser
dow
fountain-test
freddy2001
gerakitools
germancontributioncounts
grantmetrics
gyan
igloo
inactiveadmins
ip-range-calc
jimmy
khanomalumat
linedwell
mediaviews
metaviews
mostlinkedmissing
mrmetadata
musikanimal
osmlint
patrolstats
periodibot
poiimport
portal
ptwikis
quarry
render-tests
rotbot
russbot
searchsbl
shrinitools
shuaib
shuaib-bot
sign-language-browser
slumpartikel
soweego
stockholm-mania
svgtranslate
tessdata
text2hash
timerelengteam
title-search
toolhub
toolschecker-ge-ws
tulsibot
urbanecmbot
validator
vvoters
wahldiagramm
wdmap
wikidata-timeline
wikiedudashboard-test
wikilinkbot
wptestblog2
wscontest
yemen
zhdeletionpedia
zhwiki-qualifications-check

webgrid-generic:

12:15:48 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ comm -23 <(qstat -u \* -q webgrid-generic -xml | grep JB_owner | grep -oP '(?<=<JB_owner>tools\.).+(?=</JB_owner>)' | sort) <(curl -s tools-proxy-05:8081/list | jq . | grep -oP '(?<=").+(?=": {)' | sort)
montage-dev
russbot

Many tools seem affected and T242166 is probably related. Not sure what happened. Shall I mass restart?

Event Timeline

bd808 renamed this task from Many webservice not registered at tools-proxy redis to Many grid engine backend webservices not registered at tools-proxy redis following depool restarts.Jan 13 2020, 6:41 PM
bd808 edited projects, added cloud-services-team (Kanban); removed cloud-services-team.
bd808 subscribed.

Is it possible this is related to T242397: Make webservice grid jobs "non-rerunable" ?

Might be.

I am 99.999% sure that the depool triggered restarts from the work I did to shrink the grid engine webservice queue is the cause of this.

Shall I mass restart?

I will do that. I made the mess.

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:29:50Z] <bd808> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:40:20Z] <bd808> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:47:17Z] <bd808> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:52:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:53:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:53:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:54:20Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:54:51Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:55:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:55:51Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:56:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:56:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:57:23Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:57:48Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:58:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:58:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:59:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T20:59:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:00:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:00:35Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:01:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:01:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:02:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:02:38Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:03:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:03:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:04:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:04:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:04:37Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:05:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:05:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:06:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

bd808 triaged this task as Unbreak Now! priority.Jan 13 2020, 9:06 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:06:34Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:07:04Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:07:33Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:08:08Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:08:35Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:09:03Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:09:33Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:09:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:10:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:10:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:11:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:11:51Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:12:20Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:12:50Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:13:23Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:13:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:14:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:14:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:15:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:15:51Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:16:20Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:16:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:17:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:17:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:18:20Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:18:53Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:19:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:19:51Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:20:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:20:49Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

Mentioned in SAL (#wikimedia-cloud) [2020-01-13T21:21:19Z] <wm-bot> <root> Restarted webservice to fix broken registration with the front proxy (T242538)

I ended up using this very magic command to perform the restarts and logging for most of the tools:

$ for tool in $(comm -23 <(qstat -u \* -q webgrid-lighttpd -xml | grep JB_owner | grep -oP '(?<=<JB_owner>tools\.).+(?=</JB_owner>)' | sort) <(curl -s tools-proxy-05:8081/list | jq . | grep -oP '(?<=").+(?=": {)' | sort)); do
echo $tool
sudo become $tool -- webservice restart
sudo become $tool -- dologmsg 'Restarted webservice to fix broken registration with the front proxy (T242538)'
done
catgraph
...........Restarting webservice............
cgstat
...........Restarting webservice.............
[...snip...]
zhdeletionpedia
..........Restarting webservice..............
zhwiki-qualifications-check
..........Restarting webservice..............
$

Thanks for that awesome live status comparison code @zhuyifei1999

After the scripted restarts, these 4 tools still had some issues:

  • convert
    • Job running on grid; no obvious errors in $HOME/error.log; an additional manual restart seems to have helped
  • dewikinews-rss
    • Job running on grid; $HOME/error.log so big that using it to debug is difficult (reported as T242680); manual stop/start seemed to bring service back up
  • omarghridabot
    • No sign that this tool was ever designed to run a webservice. The tool is showing up on the list because it has job named "replace.py" stuck in qw state because no queue name has been assigned. Possibly caused by a manual qalter replace.py command in the tool's $HOME/.bash_history?
  • sign-language-browser
    • Tool appears to be running as expected, but it is not listed in the curl -s tools-proxy-05:8081/list output. Another manual stop/start cycle fixed the proxy's tracking of the state.
$ comm -23 <(qstat -u \* -q webgrid-lighttpd -xml | grep JB_owner | grep -oP '(?<=<JB_owner>tools\.).+(?=</JB_owner>)' | sort) <(curl -s tools-proxy-05:8081/list | jq . | grep -oP '(?<=").+(?=": {)' | sort)
$