db2060 not accessible
Closed, ResolvedPublic

Description

db2060 paged for disk space.

This is the ilo:

[612121.400194] sd 0:1:0:0: rejecting I/O to offline device
[612121.425964] sd 0:1:0:0: rejecting I/O to offline device
db2060 login: [612139.042379] sd 0:1:0:0: rejecting I/O to offline device
[612139.290258] sd 0:1:0:0: rejecting I/O to offline device
[612155.992074] sd 0:1:0:0: rejecting I/O to offline device
[612156.310094] sd 0:1:0:0: rejecting I/O to offline device

Related Objects

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2017, 7:02 PM
Marostegui assigned this task to Papaul.Jan 25 2017, 7:38 AM
Marostegui added a project: ops-codfw.
Marostegui added a subscriber: Papaul.

I had rebooted the server as it wasn't responding.
ILO logs aren't showing anything, neither system logs.
However yesterday, as it can be see on the original ticket message, it was having I/O errors, which suggests that the storage went away, which is the second time this happens.

@Papaul can we get HP to replace the RAID controller?

Restricted Application added a project: Operations. · View Herald TranscriptJan 25 2017, 7:38 AM

I have restarted db2060 because there is no reason to have mysql down there- if it was corrupted, which I believe it shouldn't, we would find out, and if it isn't we lose nothing. We can always reimage it if necessary, but no advantage of having it out of date, except that it takes a few extra minutes to shut it down.

jcrespo triaged this task as "Normal" priority.Jan 25 2017, 6:23 PM
Marostegui moved this task from Triage to Next on the DBA board.Jan 26 2017, 11:05 AM
Marostegui moved this task from Next to Blocked external/Not db team on the DBA board.
Papaul added a comment.Feb 3 2017, 8:01 PM

I will need a maintenance window set for this system on Monday from 10am to 1pm for the controller replacement. Thanks
Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5316997127
Status: Case is generated and in Progress

Product description: HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server
Product number: 665552-B21
Serial number: 2M245205H2
Subject: DL380p Gen8 - Controller Failure

Yours sincerely,
Hewlett Packard Enterprise

I will need a maintenance window set for this system on Monday from 10am to 1pm for the controller replacement. Thanks

Thanks!
No problem, I will make sure the system is down for that time :-)

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5316997127
Status: Case is generated and in Progress

Product description: HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server
Product number: 665552-B21
Serial number: 2M245205H2
Subject: DL380p Gen8 - Controller Failure

Yours sincerely,
Hewlett Packard Enterprise

Change 335980 had a related patch set uploaded (by Marostegui):
db-codfw.php: Depool db2060

https://gerrit.wikimedia.org/r/335980

Change 335980 merged by jenkins-bot:
db-codfw.php: Depool db2060

https://gerrit.wikimedia.org/r/335980

Mentioned in SAL (#wikimedia-operations) [2017-02-06T07:01:25Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2060 - T156161 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-06T14:19:01Z] <marostegui> Stop MySQL and shutdown db2060 for maintenance - T156161

@Papaul db2060 is now off so you can proceed whenever you want.
Thanks!

I have restarted MySQL on db2060, as this is probably done now. Waiting for repl. to catch up, and for Manuel to resolve if he thinks it is ok.

Thanks @jcrespo! Unfortunately the last thing I heard from @Papaul was that the HP technician didn't show up (he was still waiting for him) so I asked him to turn the server on once he was done (either if the HP technician came or not).
Let's wait for @Papaul to confirm if the controller was changed.

Thanks

Papaul added a comment.Feb 7 2017, 4:04 PM

Unfortunately the HP tech didn't show up. I m following up with HP on the case.

Thanks @Papaul - I will leave the server depooled so we can shut it down anytime once you've arranged another day and time

Papaul added a comment.Feb 7 2017, 4:56 PM

The service was canceled, according to HP they couldn't get in touch with me; which is not true because i didn't received any calls or emails from them. Another service call is set for Thursday 9th between 9am and 11am.

Thanks for the heads up! I will get the server ready by Thursday then!
Thank you!

Mentioned in SAL (#wikimedia-operations) [2017-02-09T16:17:17Z] <marostegui> Shutdown db2060 for maintenance - T156161

The server is off now. Feel free to turn it on once it is all done (or if the HP technician doesn't show up again)
Thank you!

Papaul added a subscriber: RobH.Feb 9 2017, 8:46 PM

unfortunately once again the tech didn't show up as scheduled between 9 am and 11am. I had to call HP and find out why but they couldn't tell me the reason the tech didn't show up. I end up requesting that the part been sent to me for me to replace it by myself but the HP support person insure me that he has escalate the issue and having directly an HP engineer schedule and not a third party anymore.

note: after i left site around 2:45pm i got a call from a tech (Joe Franch) telling me that he is on site to replace the main board. It looks like there is no communication between HP and they dispatch team.

@Marostegui : i didn't wanted to set up the dispatch for tomorrow since it is Friday so i set it up for Monday.
@RobH: All the information you need is on this task let me know if you have any questions.

new dispatch:
Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request (5316997127).

We have scheduled your onsite task (5316997127-533).The onsite delivery will occur before 2017-02-13 13:00:00. We will notify you again when a technician is on the way.

If you have any questions or wish to reschedule please contact us dispatch.support@hpe.com.

Yours sincerely,
Hewlett Packard Enterprise
ref:_00Dd0bUlK._50027olxWC:ref

Thanks @Papaul for handling all this.
I will get the server ready for you on Monday again

Mentioned in SAL (#wikimedia-operations) [2017-02-13T13:57:00Z] <marostegui> Shutdown db2060 for maintenance - T156161

@Papaul db2060 is now off. Once you are done, please power it on and I will take care from there.
Thanks!

Papaul reassigned this task from Papaul to Marostegui.Feb 13 2017, 5:47 PM

Main board replacement complete.

Thanks! I will take it from here!

Marostegui closed this task as "Resolved".Feb 14 2017, 6:48 AM

I am going to close this for now, but will leave the server depooled for the next few days. If we see this happening again we'll reopen it

Thanks for your help!

Mentioned in SAL (#wikimedia-operations) [2017-02-16T07:54:12Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2060 - T156161 (duration: 00m 44s)

I have repooled the server.