Add monitoring for disabled grid nodes to the prometheus collector
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Feb 11 2019, 10:16 PM

Description

There's nothing that collects the state of grid queues (and thus nodes) in Toolforge. We've found states before where we are surprised by several queues disabled because of errors on the nodes. A queue in this sense in SGE is a host within a specific queue context (such as task@tools-sgeexec-0901.tools.eqiad.wmflabs).

Collecting that info so it can be displayed on tools-basic-alerts or similar should help.

So far we've found that errors with the webservice script itself (since it is involved in job submission and such) is enough to drop a queue, but LDAP errors on job submission will do the same (putting the queue into the "e" state, which is just as useless as the "d" or depooled state or the "au" or unreachable state). This can, if it happens repeatedly on various queue/host combinations, take the entire grid offline in time.

Remediation is documented for error states at: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Clearing_error_state
However, for error states, it can be reset as simply as qmod -c '*' if you aren't worried about troubleshooting.

We need at least an email or mention on tools-basic-alerts if the number of available node queues is declining (and if any are in a persistent "e" state).

Details

	Subject	Repo	Branch	Lines +/-
	gridengine: Add prometheus monitor for queue/host health	operations/puppet	production	+50 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T177195 Reduce technical debt in metrics monitoring
Resolved	fgiunchedi	T177196 Port non-deprecated Diamond collectors to Prometheus
Resolved	• GTirloni	T207591 tools-services: Migrate to Stretch
Resolved	bd808	T211684 Toolforge: Port sge.py stats to Prometheus
Resolved	• Bstorm	T215845 Add monitoring for disabled grid nodes to the prometheus collector
Open	None	T213567 Toolforge: refresh grafana dashboard

Event Timeline

• Bstorm claimed this task.Feb 11 2019, 10:16 PM

• Bstorm triaged this task as High priority.

• Bstorm created this task.

Apparently, LDAP timeouts under certain conditions can leave a queue in the "e" state, which is effectively disabled T217280#5007467

• Bstorm updated the task description. (Show Details)Mar 7 2019, 5:15 PM

• Bstorm moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

• Bstorm updated the task description. (Show Details)

Using qstat -explain aAcE -xml may be the best way to find these. Here's a sub-record showing a queue in E state:

<Queue-List>
  <name>task@tools-sgeexec-0923.tools.eqiad.wmflabs</name>
  <qtype>BI</qtype>
  <slots_used>0</slots_used>
  <slots_resv>0</slots_resv>
  <slots_total>50</slots_total>
  <load_avg>0.93000</load_avg>
  <arch>lx-amd64</arch>
  <state>E</state>
  <message>queue task marked QERROR as result of job 622967&apos;s failure at host tools-sgeexec-0923.tools.eqiad.wmflabs</message>
</Queue-List>

Here's a normal state sub-record:

<Queue-List>
  <name>task@tools-sgeexec-0924.tools.eqiad.wmflabs</name>
  <qtype>BI</qtype>
  <slots_used>5</slots_used>
  <slots_resv>0</slots_resv>
  <slots_total>50</slots_total>
  <load_avg>2.10000</load_avg>
  <arch>lx-amd64</arch>
</Queue-List>

<state/> only shows up records for queue nodes that are in some type of distress.

‘c’ displays the reason for the c(onfiguration ambiguous) state of a queue instance.
‘a’ shows the reason for the alarm state.
Suspend alarm state reasons will be displayed by ‘A’.
‘E’ displays the reason for a queue instance error state.

Yeah. I was poking at it from the per-host end, but you cannot get the reason from qhost, just qstat, from what I can tell.

Change 495262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: Add prometheus monitor for queue/host health

https://gerrit.wikimedia.org/r/495262

gerritbot added a project: Patch-For-Review.Mar 8 2019, 5:02 PM

Change 495262 merged by Bstorm:
[operations/puppet@production] gridengine: Add prometheus monitor for queue/host health

https://gerrit.wikimedia.org/r/495262

aborrero mentioned this in T213567: Toolforge: refresh grafana dashboard.Apr 22 2019, 3:40 PM

aborrero added a subtask: T213567: Toolforge: refresh grafana dashboard.

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:18 PM