Page MenuHomePhabricator

Can't SSH to mw1290.mgmt
Closed, ResolvedPublic

Description

The host itself looks ok

Event Timeline

From time to time the mgmt/idrac becomes unresponsive, we will need to power off the host for 10-30secs. Please depool this server and we'll take care of it.

depooled

17:09 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet

Dzahn triaged this task as Medium priority.

downtimed in Icinga for an hour.. mgmt and server and all services on them.

first attempt the server came back just fine but mgmt was not fixed yet. we are trying a second time and leave it off longer.

Mentioned in SAL (#wikimedia-operations) [2019-09-30T21:47:55Z] <mutante> mw1290 - scap pull to get it in sync with latest deployment - it was down during scap run for T234153

rebooting it unfortunately did not fix mgmt yet. currently pooled again.

@Dzahn the server needs to be powered off and power removed...can you depool again and leave it depooled for 24 hours please. I will update the task once complete.

@Cmjohnson Depooled and scheduled an Icinga downtime for about 2 days. Go ahead.

Mentioned in SAL (#wikimedia-operations) [2019-10-03T19:40:37Z] <mutante> mw1290 - depooled and scheduled downtime in Icinga for hardware maintenance T234153

@Jclark-ctr checked on this. (Thanks!) but this still needs to happen. One minute i could SSH to it just fine and 12 minutes later it was alerting in Icinga again. So it keeps being "from time to time" and Chris' comment " we will need to power off the host for 10-30secs." still stands.

I checked it was already repooled meanwhile by somebody and ACKed it in Icinga.

Agreed with Jclark we can do it next week when he is back onsite. It's not urgent.

flea power was drained by Jclark.

I can ssh to mgmt (faster than before). Looking good so far.

Works for now. If we see it again we will just reopen this.