Page MenuHomePhabricator

Rebuild a bunch of tools instances
Closed, ResolvedPublic

Description

Live migration of instances causes the full (non copy-on-write) size of the instance to be allocated on the new host. So now the labvirt nodes are full of zero'd space and there isn't room to migrate any more hosts there.

We need to rebuild any tools instances that can be rebuilt non-disruptively. That'll shrink things down and give us the space we need. Exec nodes are a good place to start, since they are xlarge size.

There's no rush on this, so we should be as gentle as possible. Create new nodes, depool old nodes, wait a day, etc.

Event Timeline

Andrew created this task.Apr 28 2015, 5:15 PM
Andrew assigned this task to yuvipanda.
Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added projects: Cloud-Services, Toolforge.
Andrew added a subscriber: Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2015, 5:15 PM
valhallasw triaged this task as High priority.Apr 28 2015, 7:40 PM
valhallasw added a subscriber: valhallasw.

I'll still mark this as 'High', as this should happen sooner than most bugs that are 'Normal' priority.

valhallasw moved this task from Triage to In Progress on the Toolforge board.Apr 28 2015, 7:40 PM

Yup. Let's just fix the disk layout as we go as well.

I propose we make them all xlarge and have 10 for precise and 5 for trusty (exec nodes). Webgrid can come later.

coren added a subscriber: coren.Apr 28 2015, 7:48 PM

It's not clear to me that fewer large instances is better than more smaller instances. Because of virtualization we don't "pay" for idle cycles, and small instances are easier to balance, faster to drain, and have less impact when they go down. Gridengine is not impacted by number of nodes (or at least, not in the kind of number we are talking about) and turnaround for jobs tend to be better if there are many instances available to the scheduler.

After discussion on IRC, we're going to settle on making them all large instances. I'm going to create 15 large precise and 5 large trusty instances and then we see how it goes.

Naming scheme is:

precise - tools-exec-12xxx
trusty - tools-exec-14xxx

Using this as an opportunity to fix T97445 and T95979

Alright, so I've created tools-exec-12{01-10} and tools-exec-14{01-10}. I've also pooled in tools-exec-14{01-05} and depooled almost all the old trusty nodes (except tools-exec-20, which has one 'task' still executing). Going to pool some precise nodes now.

Ok, so tools-exec-14{01-10} are pooled now, and so are tools-exec-12{01-10} :D All old trusty instances except tools-exec-20 are deleted as well.

So everything in tools-exec-{01-10} has been disabled and drained of continuous jobs.

We're going to need more nodes, I think. I'm going to add 10 more precise larges and 5 more trusty larges. Some of the nodes being decommed are xlarges too, while all of the new ones are larges.

Created tools-exec-121{1-9}, and just ran out of quota.

This comment was removed by yuvipanda.

Created tools-exec-121{1-9} and pooled them :) Also drained tools-exec-1{1-5} of continuous jobs.

Things left to do:

  1. Wait for tools-exec-xx (anything with two digits) to have no running tasks, depool and delete them
  2. Add about 5 more trusty nodes.
  3. Decide what to do about the dedicated nodes..

I forgot to give the new instances public IPs, which was causing a bunch of failures for IRC bots. That has been remedied now with a lot of clicking.

When this is all done I'm going to write this up on the Admin docs.

Depooled and deleted tools-exec-20 :) So trusty is fully on the new hosts now.

I wonder how long we should give the currently running tasks before moving them.

Trusty webgrid has been expanded and replaced with new hosts.

I built new precise and generic webgrid hosts too but they seem to be dead on arrival and not running puppet at all for some strange reason (Can't login with root key either?). Will investigate later.

Generic hosts migrated now. Just precise ones left :)

Done for all webgrid instances too \o/

Now we just figure out when to kill the rest of the running
non-restartable task jobs.

tools-exec 02,07,08,13,14 and 15 are still in the @general host group; is that intended?

On tools-webgrid-03 and -08, I have killed a few php-cgi processes that were stuck in some gettimeofday loop; strace showed just

gettimeofday({1430577841, 593500}, NULL) = 0
gettimeofday({1430577841, 593544}, NULL) = 0
gettimeofday({1430577841, 593576}, NULL) = 0
gettimeofday({1430577841, 593620}, NULL) = 0

ad infinitum. I have deleted tools-webgrid-02, -03 and -08.

Yes, the remaining tools-exec hosts are still in the hostgroup but they have been disabled, so is ok.

Everything except -07 and -08 is now gone.

this is done now, right?

yuvipanda closed this task as Resolved.May 23 2015, 1:19 PM

Yes :)

tools-mail still has to go, but this is taking longer due to incomplete puppetization. (T97574: Provision and test tools-mailrelay-02). If you feel the virt hosts needs the space taken by that host, we can speed it up, but it's only 20GB iirc.