Page MenuHomePhabricator

analytics1069 mgmt interface intermittently goes up and down
Closed, ResolvedPublic

Description

Hi!

For some reason I keep seeing analytics1069's mgmt interface popping up in icinga from time to time, maybe a faulty cable?

Thanks!!

Event Timeline

Marostegui triaged this task as Medium priority.Sep 27 2021, 4:30 AM
Marostegui added a subscriber: razzi.

Replaced the cable but still don't have access, this server will require me to power it off and drain flea power. That has been the standard fix for these types of issues.

@BTullis @razzi can you sync with Chris to perform this maintenance during the next days?

Cmjohnson added subscribers: Jclark-ctr, Cmjohnson.

@BTullis or @razzi please coordinate next week with @Jclark-ctr. @Jclark-ctr this server needs the flea power drained, power off, remove power cables, unseat power supplies, hold the power button for 20-30 seconds and plug it all back and power on. This should correct the idrac issue.

Yes I'm more than happy to help out on this. @Jclark-ctr if you have a suggested time when you'd like to do the work, I'll sort out downtime and shut down the host ahead of time.

@BTullis I am available tomorrow morning 2:00 PM UTC. 10AM EST

Mentioned in SAL (#wikimedia-operations) [2021-10-12T13:11:40Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on analytics1069.eqiad.wmnet with reason: draining flea power T291732

Mentioned in SAL (#wikimedia-operations) [2021-10-12T13:11:47Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on analytics1069.eqiad.wmnet with reason: draining flea power T291732

Host is not booting cleanly.
We get an error from /dev/sdc on boot and it required the root password for maintenance.
dmesg shows this.

[  105.195864] sd 0:2:2:0: [sdc] tag#500 BRCM Debug mfi stat 0x2d, data len requested/completed 0x10000/0x0
[  105.195875] sd 0:2:2:0: [sdc] tag#500 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  105.195878] sd 0:2:2:0: [sdc] tag#500 Sense Key : Medium Error [current] 
[  105.195880] sd 0:2:2:0: [sdc] tag#500 Add. Sense: No additional sense information
[  105.195884] sd 0:2:2:0: [sdc] tag#500 CDB: Read(16) 88 00 00 00 00 00 00 80 08 00 00 00 00 80 00 00
[  105.195886] print_req_error: I/O error, dev sdc, sector 8390656
[  116.997740] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.015978] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.032014] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.052004] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.067992] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.088006] sd 0:2:2:0: [sdc] tag#762 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[  117.088016] sd 0:2:2:0: [sdc] tag#762 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  117.088019] sd 0:2:2:0: [sdc] tag#762 Sense Key : Medium Error [current] 
[  117.088022] sd 0:2:2:0: [sdc] tag#762 Add. Sense: No additional sense information
[  117.088026] sd 0:2:2:0: [sdc] tag#762 CDB: Read(16) 88 00 00 00 00 00 00 80 08 48 00 00 00 08 00 00

lshw -class disk shows this snippet for that disk.

*-disk:5
     description: SCSI Disk
     product: PERC H730 Mini
     vendor: DELL
     physical id: 2.2.0
     bus info: scsi@0:2.2.0
     logical name: /dev/sdc
     version: 4.27
     serial: 00e8b77f0502e7cb2000dc00aea06d86
     size: 3725GiB (4TB)
     capabilities: gpt-1.00 partitioned partitioned:gpt
     configuration: ansiversion=5 guid=672f14c1-6f49-4bc1-a651-8157035ee300 logicalsectorsize=512 sectorsize=512

I quit out of the maintenance prompt with Ctrl-D but it failed at fsck again.

Reloading system manager configuration
Starting default target
[ 1167.465725] print_req_error: I/O error, dev sdc, sector 8390656
[ 1178.934303] print_req_error: I/O error, dev sdc, sector 8390728
[ 1178.940913] Buffer I/O error on dev sdc1, logical block 1048585, async page read
[ 1179.030301] print_req_error: I/O error, dev sdc, sector 8390728
[ 1179.036910] Buffer I/O error on dev sdc1, logical block 1048585, async page read
You are in emergency mode. AfterGive root password for maintenance

It's doing this repeatedly. I will see if there is a megacli command that I can use to set the disk to offline.

One thing that I do when this happens is to enter the root password and comment the disk in /etc/fstab, and then powercycle. In theory the OS should boot fine, and once booted a phabricator task for the broken disk should be fired (once nagios checks it I mean).

Great, thanks @elukey - I had got as far as looking at various megacli commands, but as far as the RAID controller was concerned everything is fine. It's just the physical disk that is throwing the errors.
I'll comment it out and reboot as you suggest.

preformed flea power drained, power off, remove power cables, unseat power supplies, hold the power button for 20-30 seconds and plug it all back in and power

Sorry for delay in updating on my part datacenter wifi has been down on site