Page MenuHomePhabricator

es1019 IPMI and its management interface are unresponsive (again2)
Closed, ResolvedPublic

Description

For the sixth time, this happens: T120689 T155691 T187530 T201132 T213422

Normally I wouldn't bother with it unless there is some maintenance scheduled, but sadly this is blocking T147074 (which affects dc ops too) so we better go through once again.

Stopping the server, disconnecting the cables and reconnecting them should do. Please coordinate with @Marostegui as this host is in production and we will only stop it with you available or we will lose control of it.

Event Timeline

Marostegui added a parent task: Restricted Task.
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

@Cmjohnson or @Jclark-ctr let me know when it is a moment to power drain this host and I will have it ready (aka I will depool it)

@Marostegui Can you depool it leave it for us to do when we get a free moment. It's an easy thing to do but may not happen until later in the day after your normal day.

@Marostegui Can you depool it leave it for us to do when we get a free moment. It's an easy thing to do but may not happen until later in the day after your normal day.

Hey!
Sorry, I was already offline when you wrote this.
These kind of hosts are not easy to depool, so it is better if we schedule a day to perform this so we can organize ourselves and depool it way ahead.
Let me know which day would work for you - I am out next week from Monday to Wednesday.

Thanks!

@Marostegui can you please depool this. I will do around 1400UTC on Thursday

@Marostegui can you please depool this. I will do around 1400UTC on Thursday

Will have it ready by Thursday - thanks!

Change 540006 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1019

https://gerrit.wikimedia.org/r/540006

Change 540006 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1019

https://gerrit.wikimedia.org/r/540006

Mentioned in SAL (#wikimedia-operations) [2019-10-03T13:40:06Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1019 for on-site maintenance T233698 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2019-10-03T13:44:53Z] <marostegui> Stop MySQL and shutdown es1019 for on-site maintenance - T233698

@Cmjohnson es1019 is now off and ready for you.
Once you are done, power it back on.

Thanks!

Cmjohnson claimed this task.

I powered the server off, unplugged everything, removed the PSU drained flea power by holding the power button for 30 secs, plugged everything back in. I am able to access mgmt now

ssh to mgmt works now. confirmed. But IPMI from remote fails to establish a session. Could you please check if "IPMI over LAN" is disabled in BIOS and enable it?

I take that comment back. (now) it works for me.

Mentioned in SAL (#wikimedia-operations) [2019-10-04T05:08:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 after on-site maintenance T233698 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-10-04T05:49:42Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to es1019 after on-site maintenance T233698 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-10-04T05:59:31Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool es1019 after on-site maintenance T233698 (duration: 00m 51s)