There's also currently only one tomcat node, so when it goes down all java webservices are dead.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T90534 Make toolforge reliable enough (tracking) | |||
Declined | None | T91068 Set up a schedule for doing failover exercises for toollabs | |||
Resolved | Andrew | T90542 Make sure that toollabs can function fully even with one virt* host fully down | |||
Resolved | yuvipanda | T91066 Retire 'tomcat' node, make Java apps run on the generic webgrid |
Event Timeline
Change 193390 had a related patch set uploaded (by Yuvipanda):
tools: Move tomcat tools to generic node
Change 193394 had a related patch set uploaded (by Yuvipanda):
tools: Add tomcat starter & required packages to generic nodes
Change 193394 merged by Yuvipanda:
tools: Add tomcat starter & required packages to generic nodes
How much memory are we saving by having separate nodes for lighttpd-based tasks and overprovisioning them? (If that is still true; modules/toollabs/manifests/node/web/lighttpd.pp doesn't mention anything specific.)
Otherwise, why not only have "generic execution nodes" (or "Precise exec node" and "Trusty exec node"), so we don't have to have two of each type? Fewer instances to worry about.
@scfc I agree. Am going to kill tomcat and uwsgi nodes, and merge them all into 'generic'.
Change 193559 had a related patch set uploaded (by Yuvipanda):
Point users to webservice2 for tomcat
Moved them all off, and they all seem to work! yay!
Just need to merge https://gerrit.wikimedia.org/r/#/c/193559/ and then I can kill the node + related puppet code.
Change 193561 had a related patch set uploaded (by Yuvipanda):
tools: Remove tomcat node definitions from puppet
No, I didn't mean "generic web node", but "generic execution node". All jobs on all nodes, no nodes that only run jobs in a subset of queues. (Maybe prioritize web queues over others so that the start-up time of a webservice is in the interactive range.) Only one type of execution node = only one type of node to spread over the virtual servers.
Hmm, so there are still other tools that treat 'webgrid-tomcat' as they should treat webgrid-generic. I guess I should hunt them down one by one and change it.
@yuvipanda my tool find-and-replace uses the portgrabber to start it's tornado webserver. I use jstart -q webgrid-generic to start the script, but it still only hits the old tools-webgrid-tomcat node and not the generic ones. Is this expected?
@Sitic: Can you also add -l release=trusty? That should transition it immediately, since otherwise jstart defaults to precise, and the newer hosts are all trusty.
It looks like only tools running on -tomcat node now are ones that were started with qsub and hence have no filters restricting them to the exec nodes. I'll just get rid of the node in a couple of hours when nothing is running on it.
Hmm, so
yuvipanda@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs Host object "tools-webgrid-tomcat.eqiad.wmflabs" is still referenced in cluster queue "webgrid-generic".
Which is strange because I don't see tools-webgrid-tomcat in the webgrid-generic queue.
qhost -j (NB: qhost, no hostname) shows:
[…] tools-webgrid-tomcat.eqiad.wmflabs lx26-amd64 8 - 15.7G - 1.9G - 9804789 0.30012 opentask tools.sugges r 04/10/2015 05:45:04 webgrid-ge MASTER
Did you delete the instance with the job still running?
scfc@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-tomcat.eqiad.wmflabs" from execution host list scfc@tools-bastion-01:~$
scfc@tools-bastion-01:~$ qconf -ds tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-tomcat.eqiad.wmflabs" from submit host list scfc@tools-bastion-01:~$
Did rm -f /data/project/.system/store/*-tools-webgrid-tomcat.eqiad.wmflabs. Anything else to do?