Page MenuHomePhabricator

Evaluate gridengine's use of NFS and (possibly) move it to a different volume
Closed, DeclinedPublic

Description

Right now, gridengine appears to be an unreasonably high hit on the tools volume. We need to

  • Make certain gridengine's use of NFS is as limited as possible
  • Move it off the tools volume if required and useful

Event Timeline

coren raised the priority of this task from to High.
coren updated the task description. (Show Details)
coren subscribed.
coren renamed this task from Evaluate gridengine's use of NFS and (possibly) move it to a different module to Evaluate gridengine's use of NFS and (possibly) move it to a different volume.Sep 8 2015, 3:46 PM
coren set Security to None.

Right now, the only shared directory between gridengine nodes is /var/lib/gridengine which contains (a) the job spools, (b) the shared configuration and (c) log and journals.

Technically, this is the minimal shared filesystem for proper shadow master (failover) operation, and while logs and journals could be separated out, they are neither particularly large nor especially high volume so the benefit of the added complexity would be unclear.

It looks like, at this point, the next step should be to spin the gridengine filesystem away from the general tools volume onto its own and export that. If nothing else, this will insulate traffic and allow us to more precisely determine what the actual respective loads are.

Right now, the only shared directory between gridengine nodes is /var/lib/gridengine which contains (a) the job spools, (b) the shared configuration and (c) log and journals.

Technically, this is the minimal shared filesystem for proper shadow master (failover) operation, and while logs and journals could be separated out, they are neither particularly large nor especially high volume so the benefit of the added complexity would be unclear.

It looks like, at this point, the next step should be to spin the gridengine filesystem away from the general tools volume onto its own and export that. If nothing else, this will insulate traffic and allow us to more precisely determine what the actual respective loads are.

@coren: Moving off spool directories would be a significant reduction in traffic I think. http://gridscheduler.sourceforge.net/howto/nfsreduce.html supports that, and shows very little drawback. What am I missing?

@mark: The biggest disadvantage is the loss of history - right now, we can add and remove nodes on the grid freely without loosing logs and history. (As a side effect, we also lose easy access to logs from the bastions but that is a lesser issue).

It's perfectly reasonable to decide that this is a price worth paying - or to put in place a process or mechanism to preserve this data when a grid node is decommissioned - but I would think that being able to actually measure the actual traffic this causes in practice should be our first step. (And there are considerable advantages to splitting off the gridengine filesystem anyways, for performance and reliability).

@mark: The biggest disadvantage is the loss of history - right now, we can add and remove nodes on the grid freely without loosing logs and history. (As a side effect, we also lose easy access to logs from the bastions but that is a lesser issue).

It's perfectly reasonable to decide that this is a price worth paying - or to put in place a process or mechanism to preserve this data when a grid node is decommissioned - but I would think that being able to actually measure the actual traffic this causes in practice should be our first step. (And there are considerable advantages to splitting off the gridengine filesystem anyways, for performance and reliability).

@coren: I did measure that traffic last week, and it was a relatively high share then. I'm not sure why you're saying it's not particularly high volume if you didn't measure it?

(Just to clarify things, when I stated "neither particularly large nor especially high volume" above that applied to the logs and journals only, not the job spools which I expect are where most of the actual load is)

Oh, wait, I just noticed a possible cause of miscommunication: I am speaking about the spool DB, and the document you refer to is talking about the spool directories. The latter are already not on NFS, they live where execd_spool_dir points, which is /var/spool/gridengine/execd - local to the instances.

We are already using the "local executable files" setup from that document (which is a superset of "local spool directories").

coren moved this task from To do to Doing on the Labs-Sprint-114 board.
coren moved this task from Doing to To do on the Labs-Sprint-114 board.
coren moved this task from To do to Doing on the Labs-Sprint-114 board.

Currently, /var/lib/gridengine is a bind mount of /data/project/.system/gridengine on all nodes, which itself is just a subdirectory of the /data/project NFS mount.

After some local testing, the following is known about switching the filesystem from under the mountpoint:

  • Exec nodes do not keep open files on the filesystem; accessing it only at task start and completion to update the journal and accounting files
  • Administrative and submit nodes (various, mostly bastions) only fetch configuration from the mount point at invokation of command-line tools and run no daemon
  • The shadow master polls the configuration and heartbeat files at regular interval, but keeps the files unopened (unless/until it switches to active master role, in which case see below)
  • The active master keeps the spool DB open while it runs, and makes very frequent I/O to logs and journals.

In practice, this means that any switch of the filesystem that does not leave the mountpoint empty will work for all but the active master; with the caveat that any accounting/journal entry written by exec nodes between the time of the copy and the mount of the new filesystem will have gone to the older copy and will be lost. sge_execd daemons need not be restarted for the switch to occur[1].

Submit and active nodes only read from the configuration, so presuming the old and new filesystems are consistent, there will be no effect from the switch.

The gridengine master, on the other hand, must be stopped entirely during the switch as it holds open filehandles to a read/write BDB.

A workable switch plan would be:

  1. create the new volume and filesystem (it can be very small: the filesystem uses less than 200M of files, plus however much space we want to keep available for logs)
  2. export filesystem
  3. make a preleminary copy of the contents of the old filesystem to the new one (will not be consistent)
  4. stop master
  5. resync the new filesystem - will be very quick (< 5m)
  6. umount the old filesystem and remount the new one on master
  7. restart master
  8. push puppet change of which filesystem to mount on /var/lib/gridengine (in case any instance reboots before the process is complete)
  9. (with salt?) mount the new filesystem on top of the old one on every other node

Outages:

  • new jobs cannot be started and gridengine command-line tools will fail between the start of step 4 and the end of step 7
  • accounting information for jobs ending between the start of step 5 and the end of step 9 will be lost as they will have been written to the old filesystem.

Mounting the new filesystem atop the old one will hide and make unavailable the bind mount that was in place until the next reboot, but this has no cost: the actual filesystem (/data/project) is already mounted.

[1] It might be possible to suspend the sge_execd on the exec nodes between step 4 and 5 and resume them at the conclusion of step 9 to save possible log entires (they will reap the ended processes on resuming) but this is hard to test and probably not worthwhile.

I will admit to having absolutely zero confidence in GridEngine doing anything that involves NFS the way you'd expect it to (see T109362) and would prefer that we:

  1. Do this only if there's a reasonably high chance this will actually fix something that's an active bottleneck (IO on a shelf?)
  2. If we are going to do this, make sure we announce a broad window and do it in the least clever, most dumb way possible (if this involves draining jobs and and restarting SGE, we should do that)

(If you think that T109362 is related to NFS, please explain there.)

I agree with both your points. This would be an expensive operation and there should be a clear reasoning to do that. In addition, mounting three file systems on top of each other (IIUC) may be technically totally sound, but IMHO is an unnecessary complication (I'm not sure if Puppet would play along, for example).

I think the better approach (if there is a problem to be solved) would be to set up a second grid engine master (with whatever configuration is optimal), test it extensively, and then, on day X, stop all jobs, point all hosts to the new master and have a configuration proven to work.

(I think the underlying problem is the flaky NFS service in Labs. It is one of the most common services in serverland, more tried and tested than many others, yet in Labs one of the weakest links. IMHO the solution to try to decommission NFS in Labs isn't one; instead the setup in Labs needs to be hardened, and it's nice to see some improvements in that area like physically connecting only one server to the rack at any time.)

Given that gridengine's traffic accounts for roughly only 20% of the tools volume's load, and that itself hovers only at 15% of usage, it's unlikely that gridengine is the cause of the odd disk behaviour and moving it to a different volume is unlikely to be of use.

Keeping this ticket open at @mark's request until investigation of the underlying issue is complete.

After gathering data twice a day for a couple of days, I am now convinced there is no issue to solve - at least at this time. The blktrace ran over those periods show no major discrepancy between the drives of the array (that is, between the mirrors in any pair of drives) with writes and reads being balanced within roughly 10% of each other at each sample time; with the same results for iostats over the periods where blktrace was running. In all cases, the traffic is driven mostly by nfsd with a significant fraction (~15%) of writes coming from kworker threads - consistent with dirty blocks being flushed.

In other words, current operation is perfectly within expected bounds.

There is a very large discrepancy in the since-boot stats of both halves of the array - but the average difference between them has remained at roughly the same absolute value since I've started gathering data (2200G read) and while the magnitude of the difference in the other direction (write to the drives less read from) is smaller, it is also significant.

My conclusion is that the discrepancy is the result of a mirror being rebuilt. 2.2T is close to the disk size of 1.8T, indicating a single read of the entire drives and the slightly smaller size of writes to the opposite drives consistent with that read data being then written out except for stripes which had since been touched (as one would expect if the drives were actively being written to while the array was being rebuilt).