Page MenuHomePhabricator

Add monitoring for expected load issues on tool labs exec nodes
Closed, DeclinedPublic

Description

See also T90542

We currently have no monitoring for availability of exec nodes. This means we are not notified if a queue is overwhelmed and e.g. no more webservice processes can start.

See also https://wikitech.wikimedia.org/wiki/Incident_documentation/20150817-ToolLabs-WebgridOutage

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In T50668, I suggested for that:

  • Count of jobs in error state doesn't exceed 5 % of all jobs running,
  • count of jobs pending doesn't exceed 5 % of all jobs running.
valhallasw set Security to None.
valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.
valhallasw added a subscriber: yuvipanda.
chasemp renamed this task from Add monitoring for expected load issues to Add monitoring for expected load issues on tool labs exec nodes.Nov 30 2015, 6:42 PM
dcaro subscribed.

The Grid is going away soon, no need to add more monitoring.