Page MenuHomePhabricator

Retire 'tomcat' node, make Java apps run on the generic webgrid
Closed, ResolvedPublic

Description

There's also currently only one tomcat node, so when it goes down all java webservices are dead.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: Ricordisamoa, Andrew, scfc and 3 others.

Change 193390 had a related patch set uploaded (by Yuvipanda):
tools: Move tomcat tools to generic node

https://gerrit.wikimedia.org/r/193390

Change 193390 merged by Yuvipanda:
tools: Move tomcat tools to generic node

https://gerrit.wikimedia.org/r/193390

Change 193394 had a related patch set uploaded (by Yuvipanda):
tools: Add tomcat starter & required packages to generic nodes

https://gerrit.wikimedia.org/r/193394

Change 193394 merged by Yuvipanda:
tools: Add tomcat starter & required packages to generic nodes

https://gerrit.wikimedia.org/r/193394

How much memory are we saving by having separate nodes for lighttpd-based tasks and overprovisioning them? (If that is still true; modules/toollabs/manifests/node/web/lighttpd.pp doesn't mention anything specific.)

Otherwise, why not only have "generic execution nodes" (or "Precise exec node" and "Trusty exec node"), so we don't have to have two of each type? Fewer instances to worry about.

@scfc I agree. Am going to kill tomcat and uwsgi nodes, and merge them all into 'generic'.

Hand moving the tomcat jobs onto the generic node now.

Change 193559 had a related patch set uploaded (by Yuvipanda):
Point users to webservice2 for tomcat

https://gerrit.wikimedia.org/r/193559

Moved them all off, and they all seem to work! yay!

Just need to merge https://gerrit.wikimedia.org/r/#/c/193559/ and then I can kill the node + related puppet code.

Change 193561 had a related patch set uploaded (by Yuvipanda):
tools: Remove tomcat node definitions from puppet

https://gerrit.wikimedia.org/r/193561

No, I didn't mean "generic web node", but "generic execution node". All jobs on all nodes, no nodes that only run jobs in a subset of queues. (Maybe prioritize web queues over others so that the start-up time of a webservice is in the interactive range.) Only one type of execution node = only one type of node to spread over the virtual servers.

I'm going to agree as well, but one step at a time :)

Change 193559 merged by Yuvipanda:
Point users to webservice2 for tomcat

https://gerrit.wikimedia.org/r/193559

Hmm, so there are still other tools that treat 'webgrid-tomcat' as they should treat webgrid-generic. I guess I should hunt them down one by one and change it.

@yuvipanda my tool find-and-replace uses the portgrabber to start it's tornado webserver. I use jstart -q webgrid-generic to start the script, but it still only hits the old tools-webgrid-tomcat node and not the generic ones. Is this expected?

@Sitic: Can you also add -l release=trusty? That should transition it immediately, since otherwise jstart defaults to precise, and the newer hosts are all trusty.

Ok, so @Sitic's tool has been transitioned away!

scfc triaged this task as Low priority.Apr 6 2015, 11:08 AM

It looks like only tools running on -tomcat node now are ones that were started with qsub and hence have no filters restricting them to the exec nodes. I'll just get rid of the node in a couple of hours when nothing is running on it.

The node is gone, and the queue is gone too :)

Hmm, so

yuvipanda@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs
Host object "tools-webgrid-tomcat.eqiad.wmflabs" is still referenced in cluster queue "webgrid-generic".

Which is strange because I don't see tools-webgrid-tomcat in the webgrid-generic queue.

qhost -j (NB: qhost, no hostname) shows:

[…]
tools-webgrid-tomcat.eqiad.wmflabs lx26-amd64      8     -   15.7G       -    1.9G       -
   9804789 0.30012 opentask   tools.sugges r     04/10/2015 05:45:04 webgrid-ge MASTER

Did you delete the instance with the job still running?

Rescheduled the job, the host job list is now empty.

scfc@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-tomcat.eqiad.wmflabs" from execution host list
scfc@tools-bastion-01:~$
scfc@tools-bastion-01:~$ qconf -ds tools-webgrid-tomcat.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-tomcat.eqiad.wmflabs" from submit host list
scfc@tools-bastion-01:~$
scfc claimed this task.

Did rm -f /data/project/.system/store/*-tools-webgrid-tomcat.eqiad.wmflabs. Anything else to do?

scfc set Security to None.

Change 193561 merged by Yuvipanda:
tools: Remove tomcat node definitions from puppet

https://gerrit.wikimedia.org/r/193561