Page MenuHomePhabricator

Can't SSH to mw1290.mgmt
Open, NormalPublic


The host itself looks ok

Event Timeline

jijiki created this task.Sun, Sep 29, 6:21 AM

From time to time the mgmt/idrac becomes unresponsive, we will need to power off the host for 10-30secs. Please depool this server and we'll take care of it.

Dzahn added a subscriber: Dzahn.Mon, Sep 30, 9:09 PM


17:09 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet

Dzahn assigned this task to Jclark-ctr.Mon, Sep 30, 9:09 PM
Dzahn triaged this task as Normal priority.
Dzahn added a comment.Mon, Sep 30, 9:36 PM

downtimed in Icinga for an hour.. mgmt and server and all services on them.

first attempt the server came back just fine but mgmt was not fixed yet. we are trying a second time and leave it off longer.

Mentioned in SAL (#wikimedia-operations) [2019-09-30T21:47:55Z] <mutante> mw1290 - scap pull to get it in sync with latest deployment - it was down during scap run for T234153

Dzahn added a comment.Mon, Sep 30, 9:50 PM

rebooting it unfortunately did not fix mgmt yet. currently pooled again.

@Dzahn the server needs to be powered off and power removed...can you depool again and leave it depooled for 24 hours please. I will update the task once complete.

Dzahn added a comment.Thu, Oct 3, 7:40 PM

@Cmjohnson Depooled and scheduled an Icinga downtime for about 2 days. Go ahead.

Mentioned in SAL (#wikimedia-operations) [2019-10-03T19:40:37Z] <mutante> mw1290 - depooled and scheduled downtime in Icinga for hardware maintenance T234153

Dzahn added a comment.EditedThu, Oct 10, 6:16 PM

@Jclark-ctr checked on this. (Thanks!) but this still needs to happen. One minute i could SSH to it just fine and 12 minutes later it was alerting in Icinga again. So it keeps being "from time to time" and Chris' comment " we will need to power off the host for 10-30secs." still stands.

I checked it was already repooled meanwhile by somebody and ACKed it in Icinga.

Agreed with Jclark we can do it next week when he is back onsite. It's not urgent.