Page MenuHomePhabricator

document the need and usage patterns for special exec hosts
Closed, DeclinedPublic

Description

We currently have

We need to document what the 'terms of service' for these hosts is. Can we reboot them at will? Re-create them? Why were they provisioned originally?

Event Timeline

valhallasw raised the priority of this task from to Medium.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added a subscriber: valhallasw.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2015, 10:27 AM
valhallasw updated the task description. (Show Details)May 14 2015, 10:29 AM
valhallasw set Security to None.

So the problems I see with special hosts in general is:

  1. No redundancy. If that host goes down... jobs don't get automatically rescheduled elsewhere
  2. Special cases make it harder to administer stuff, in general (as seen with the current reshuffling of exec nodes). You can't just drain a node, rebuild it and repool it.
  3. Decreases utilization without increasing capacity. exec-catscan hardly ever has jobs running on them.

I suggest we eventually talk to the maintainers of the projects, find out their needs and figure out how exactly we can support them without special nodes. If it does need them, then fine - but we need to know the exact reasons and only have them in cases where they can't run on the regular grid.

For the -wmt case, using reservations might also be an option: http://manpages.ubuntu.com/manpages/saucy/man1/qrsub.1.html . First reserve the total amount of memory required, then submit all tasks as part of that reservation. I'm not convinced that's an easier solution than the current setup of just having a dedicated queue, though.

coren added a subscriber: coren.May 14 2015, 12:59 PM

The reason most of those dedicated queues exist is because their use of resources didn't match the general model of allocation; either because they have a lot (up to 120) of very small jobs sharing code which makes running them on a single node the only sane solution, or because their use pattern renders worst-case allocation very suboptimal.

In all cases, the tool maintainers were made aware that the increased flexibility/lowered limits meant that they lost redundancy and/or would have to deal with queue issues themselves.

@Magnus/@Cyberpower678/@Giftpflanze: could you add some context for the dedicated exec hosts in the task description? Thanks!

Giftpflanze updated the task description. (Show Details)May 14 2015, 2:43 PM
valhallasw moved this task from Triage to In Progress on the Toolforge board.May 14 2015, 6:43 PM
scfc added a subscriber: scfc.Aug 21 2015, 5:19 PM

Besides the points @yuvipanda made, I'd like to add that the current setup seems unstable to me. Those hosts typically idle for most of the month and then burst into action, causing Puppet failures due to OOM, and I can't imagine that if Puppet OOMs, the jobs running there live happily.

tools-exec-wmt is gone now (cf. T104919), and for tools-exec-catscan I don't know which of the various tools with catscan in their names is supposed to use it, but it hasn't been used at all since /var/lib/gridengine/default/common/accounting was last purged. I'll file a subtask for decommissioning that.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptAug 21 2015, 5:19 PM

The last purge of the accounting log was

valhallasw@tools-precise-dev:~$ head -n 1 /var/lib/gridengine/default/common/accounting | cut -d: -f11
1394740992
valhallasw@tools-precise-dev:~$ date --date="@1394740992"
Thu Mar 13 20:03:12 UTC 2014

so it can probably be safely decommissioned.

scfc updated the task description. (Show Details)Dec 4 2016, 4:49 PM

(tools-exec-cyberbot is gone.)

scfc closed this task as Declined.Feb 18 2017, 5:25 PM

Only tools-exec-gift is left (and will be replaced by a Trusty instance).

Regarding the initial questions: "We need to document what the 'terms of service' for these hosts is. Can we reboot them at will? Re-create them? Why were they provisioned originally?" I don't see any reason why the logic for other execution nodes would not apply here: Rebooting when necessary, recreating when necessary, originally provisioned with the reasoning of T60949. There is nothing special about tools-exec-gift, and we never treated it any differently.

So rather than guessing how to document this non-thing, I'm closing this task as declined :-).