Page MenuHomePhabricator

Add monitoring for expected load issues on tool labs exec nodes
Open, HighPublic

Description

See also T90542

We currently have no monitoring for availability of exec nodes. This means we are not notified if a queue is overwhelmed and e.g. no more webservice processes can start.

See also https://wikitech.wikimedia.org/wiki/Incident_documentation/20150817-ToolLabs-WebgridOutage

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added a subscriber: valhallasw.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In T50668, I suggested for that:

  • Count of jobs in error state doesn't exceed 5 % of all jobs running,
  • count of jobs pending doesn't exceed 5 % of all jobs running.
valhallasw set Security to None.
valhallasw moved this task from Triage to Backlog on the Toolforge board.
valhallasw added a subscriber: yuvipanda.
chasemp renamed this task from Add monitoring for expected load issues to Add monitoring for expected load issues on tool labs exec nodes.Nov 30 2015, 6:42 PM