Page MenuHomePhabricator

Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs
Closed, DeclinedPublic

Description

@dcaro found https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html that explains why a D state process results in increased loadavg on linux servers.

If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point), then the failure mode is clear:

Any NFS hiccup (otherwise harmless) can result in the Grid becoming unavailable and/or unreliable.

We may consider creating a cookbook that scans the grid for D state procs and reboot affected nodes as an automated healing mechanism.