Page MenuHomePhabricator

Failed power supply redundancy on wdqs2006
Closed, ResolvedPublic

Description

For about 11 hours now, there's an Icinga alert for the no-longer redundant power supply of wdqs2006:

Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]

Event Timeline

Note: this is one of the new wdqs servers, not in service yet.

Gehel added subscribers: Papaul, Cmjohnson.

Strange... according to T188432 the new wdqs servers are wdqs100[7-9]. The current wdqs cluster in eqiad is [[ https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L2115-L2117 | wdqs100[3-5] ]]. So what is this wdqs1006?

Another check in Icinga, the issue is actually on wdqs2006, which is one of the new wdqs server in codfw (T187800). I'm assigning this to @Papaul instead of @Cmjohnson.

Gehel renamed this task from Failed power supply redundancy on wdqs1006 to Failed power supply redundancy on wdqs2006.Feb 28 2018, 1:38 PM
Gehel updated the task description. (Show Details)

Power supply was not seal. Fixed