The host itself looks ok
Description
Event Timeline
From time to time the mgmt/idrac becomes unresponsive, we will need to power off the host for 10-30secs. Please depool this server and we'll take care of it.
depooled
17:09 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet
downtimed in Icinga for an hour.. mgmt and server and all services on them.
first attempt the server came back just fine but mgmt was not fixed yet. we are trying a second time and leave it off longer.
Mentioned in SAL (#wikimedia-operations) [2019-09-30T21:47:55Z] <mutante> mw1290 - scap pull to get it in sync with latest deployment - it was down during scap run for T234153
@Dzahn the server needs to be powered off and power removed...can you depool again and leave it depooled for 24 hours please. I will update the task once complete.
Mentioned in SAL (#wikimedia-operations) [2019-10-03T19:40:37Z] <mutante> mw1290 - depooled and scheduled downtime in Icinga for hardware maintenance T234153
@Jclark-ctr checked on this. (Thanks!) but this still needs to happen. One minute i could SSH to it just fine and 12 minutes later it was alerting in Icinga again. So it keeps being "from time to time" and Chris' comment " we will need to power off the host for 10-30secs." still stands.
I checked it was already repooled meanwhile by somebody and ACKed it in Icinga.
Agreed with Jclark we can do it next week when he is back onsite. It's not urgent.
flea power was drained by Jclark.
I can ssh to mgmt (faster than before). Looking good so far.
Works for now. If we see it again we will just reopen this.