Page MenuHomePhabricator

Add monitoring for disabled grid nodes to the prometheus collector
Closed, ResolvedPublic

Description

There's nothing that collects the state of grid queues (and thus nodes) in Toolforge. We've found states before where we are surprised by several queues disabled because of errors on the nodes. A queue in this sense in SGE is a host within a specific queue context (such as task@tools-sgeexec-0901.tools.eqiad.wmflabs).

Collecting that info so it can be displayed on tools-basic-alerts or similar should help.

So far we've found that errors with the webservice script itself (since it is involved in job submission and such) is enough to drop a queue, but LDAP errors on job submission will do the same (putting the queue into the "e" state, which is just as useless as the "d" or depooled state or the "au" or unreachable state). This can, if it happens repeatedly on various queue/host combinations, take the entire grid offline in time.

Remediation is documented for error states at: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Clearing_error_state
However, for error states, it can be reset as simply as qmod -c '*' if you aren't worried about troubleshooting.

We need at least an email or mention on tools-basic-alerts if the number of available node queues is declining (and if any are in a persistent "e" state).

Event Timeline

Bstorm triaged this task as High priority.
Bstorm created this task.

Apparently, LDAP timeouts under certain conditions can leave a queue in the "e" state, which is effectively disabled T217280#5007467

Bstorm moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.
Bstorm updated the task description. (Show Details)

Using qstat -explain aAcE -xml may be the best way to find these. Here's a sub-record showing a queue in E state:

<Queue-List>
  <name>task@tools-sgeexec-0923.tools.eqiad.wmflabs</name>
  <qtype>BI</qtype>
  <slots_used>0</slots_used>
  <slots_resv>0</slots_resv>
  <slots_total>50</slots_total>
  <load_avg>0.93000</load_avg>
  <arch>lx-amd64</arch>
  <state>E</state>
  <message>queue task marked QERROR as result of job 622967&apos;s failure at host tools-sgeexec-0923.tools.eqiad.wmflabs</message>
</Queue-List>

Here's a normal state sub-record:

<Queue-List>
  <name>task@tools-sgeexec-0924.tools.eqiad.wmflabs</name>
  <qtype>BI</qtype>
  <slots_used>5</slots_used>
  <slots_resv>0</slots_resv>
  <slots_total>50</slots_total>
  <load_avg>2.10000</load_avg>
  <arch>lx-amd64</arch>
</Queue-List>

<state/> only shows up records for queue nodes that are in some type of distress.

  • ‘c’ displays the reason for the c(onfiguration ambiguous) state of a queue instance.
  • ‘a’ shows the reason for the alarm state.
  • Suspend alarm state reasons will be displayed by ‘A’.
  • ‘E’ displays the reason for a queue instance error state.

Yeah. I was poking at it from the per-host end, but you cannot get the reason from qhost, just qstat, from what I can tell.

Change 495262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: Add prometheus monitor for queue/host health

https://gerrit.wikimedia.org/r/495262

Change 495262 merged by Bstorm:
[operations/puppet@production] gridengine: Add prometheus monitor for queue/host health

https://gerrit.wikimedia.org/r/495262

Prometheus metrics for this are sge_queueproblems and sge_disabledqueues