Page MenuHomePhabricator

Labstore2001 controller or shelf failure
Closed, ResolvedPublic

Description

Labstore2001 fails to pass POST after a reboot; the PERC BIOS reports having no configuration, and only 48 of 60 drives visible. It is possible that the missing shelf causes the configuration to fail to load; the bios reports that a 'foreign' configuration exists but that it cannot be imported because not all drives appear present).

Please diagnose.

Details

Related Gerrit Patches:

Event Timeline

coren created this task.Jun 16 2015, 2:56 PM
coren assigned this task to Papaul.
coren raised the priority of this task from to Needs Triage.
coren updated the task description. (Show Details)
coren added a project: ops-codfw.
coren added a subscriber: coren.
Restricted Application added a project: acl*sre-team. · View Herald TranscriptJun 16 2015, 2:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript


@coren This is what I have on the screen

coren added a comment.Jun 16 2015, 3:30 PM

Yes, that is what I have observed. If you do 'C', 'Y' to enter the config utility you will see 48 (out of 60) physical drives, and no configuration. (The configuration, apparently, is marked as 'foreign' and impossible to import because of the missing drives).

I can't tell remotely if one of the shelves is not working, the wiring is having issues, or the controller itself is at issue. A possible hint may be that the shelves are connected 4-1 on the different ports so it may be the port with just one shelf.

The enclosure status LED on Labstore2002 array 2 is blinking amber (the enclosure is in fault state). when i checked the back of the enclosure one of the power supply is amber. If I have to turn lastore2002 array 2 off for troubleshooting will it affect any of your operations in progress?

coren added a comment.Jun 16 2015, 4:09 PM

No, the server is not in use at this time, you can safely power it off and operate on the shelves.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 16 2015, 4:09 PM
Papaul closed this task as Resolved.Jun 16 2015, 4:17 PM

Detached the power supply from the enclosure and plugged it back fixed the problem.

coren reopened this task as Open.Jun 18 2015, 11:21 AM

That seems to have been a short-lived solution as the box is no longer able to find its disks.

Visual observation: all enclosures are online, all have green lights.

The box sees 48 disks and during reboot it complains about a foreign RAID configuration. According to @coren, it's 72 disks in total so it is not seeing for some reason 24 disks. @Papaul, can you please recheck connectivity and even restart all enclosures ?

History says that reseating everything might help

just a note that the new raid controller should arrive today. I had them
overnighted from newegg.

@akosiaris. thanks for the tip. I did that day before yesterday and fixed the problem and today same problem again since i just restarting 1 enclosure i will restart the other once as well and see.
@Cmjohnson yes i did received the new controllers but Coren is busy working now so we can not swap yet.

coren added a comment.Jun 22 2015, 3:12 PM

If it is not possible to get the two shelves working today (Jun 22), then please disconnect them from labstore2001 entirely as we need a stable system as destination for a backup.

coren triaged this task as Unbreak Now! priority.Jun 22 2015, 3:15 PM

@coren I think I was waiting on you to swap first the controller from labstore2001

coren added a comment.Jun 22 2015, 6:39 PM

Any testing we do should be to labstore2002 - we need 2001 up now for backup. Please try to fix the broken shelves and - if you can't find the issue for sure quickly enough - simply disconnect them from 2001 entirely.

@coren do you want me remove the old controller card or just leave it in there?

coren added a comment.Jun 22 2015, 7:13 PM

On 2001, leave the controller card as-is. We need the system in a predictable state.

visual observation : all disks have green light on all the shelves,and all cable are attached to the shelves are in place.

on H700 controller, it is showing 12 disks and the status is online


on H800 controller, it is showing 48 disks and the status is foreign{F182241}

i may be wrong but i think the configuration needs to be done for the H800 control for the disks attached to that controller.

coren added a comment.EditedJun 24 2015, 12:39 PM

You would see all 60 disks on the H800 if everything was okay (foreign or not). This really needs to be isolated and fixed, or - alternately - the faulty shelf need to be removed from the controller.

After reconsideration, given that there is no data on that array that is not outdated (the eqiad copy is more recent), please replace the controler of labstore2001 with the new model and rewire the shelves entirely (this will also allow us to rule wiring out).

We need to have storage avaliable in that DC as quickly as possible.

coren added a comment.Jun 24 2015, 2:12 PM

I will provide an explcit wiring diagram shortly.

New controller in place. All shelves connected the same way it was on old controller.

scfc added a subscriber: scfc.Jun 24 2015, 4:19 PM
greg set Security to None.

@Papaul, @coren: Which specific items are left to do to close this task as resolved? Asking as this task has had "Unbreak now!" priority for two weeks without updates.

@Aklapper as Coren mentioned "I will provide an explcit wiring diagram shortly." haven' t received any diagram yet.

Aklapper lowered the priority of this task from Unbreak Now! to High.Jul 7 2015, 11:22 AM

(New controller in place hence decreasing priority of this task)

Papaul lowered the priority of this task from High to Medium.Sep 23 2015, 4:52 PM
tom29739 renamed this task from Labstore2001 controler or shelf failure to Labstore2001 controller or shelf failure.Feb 25 2016, 7:55 PM
tom29739 added a subscriber: Papaul.
Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptJul 6 2016, 6:50 PM

Mentioned in SAL [2016-07-06T18:51:40Z] <mutante> labstore2001 - RAID failure in Icinga (is it T102626 ?)

@chasemp what do you want to do with this?

@chasemp what do you want to do with this?

Sorry I didn't see this message sooner @Papaul. I'm not sure yet. We are deep in the middle of fixing up the setup in eqiad, and for the moment no time to sort out this mess. I'm hopefuly early in the next quarter @madhuvishy will be able to help work this out.

Note this seems to have hit us today w/ a needed human intervention in codfw. This has be next on the agenda for storage fixup.

I'm somewhat confused on the status of labstore2001 honestly.

labstore2001: sudo megacli -AdpAllInfo -aAll | grep -i PERC
Product Name : PERC H700 Integrate

vs

root@labstore1001:/tmp# megacli -AdpAllInfo -aAll | grep -i PERC
Product Name : PERC H700 Integrated
Product Name : PERC H800 Adapter

This makes me think an H800 adapter is expected on 2001 https://phabricator.wikimedia.org/T102626#1379646

Papaul added a comment.Nov 2 2016, 2:28 PM

@chasemp we never put in the H800 adapter in labstore2001 only in labstore2002
Papaul added a comment.
Jun 22 2015, 6:47 PM
Comment Actions

@coren do you want me remove the old controller card or just leave it in there?
coren added a comment.
Jun 22 2015, 7:13 PM
Comment Actions

On 2001, leave the controller card as-is. We need the system in a predictable state.

Ok current thinking (papaul and I spoke briefly on the phone).

We do believe an H800 card is installed physically in labstore2001. We also believe an H800 card is installed physically in labstore2002. We don't know why the card in labstore2001 does not show up with sudo megacli -AdpAllInfo -aAll | grep -i PERC at the moment. We don't know if the card is detected correctly in labstore2002 either.

@madhuvishy was able to get labstore2001 to a bootable state (it takes 1.5 hours) yesterday by disabling the os-var logical volume, remounting post boot, copying to /var under os-root and rebooting. This seems to be as much of a consistent state as we've seen. (Thank you)

current lsblk output:

NAME                    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda                       8:0    0  1.8T  0 disk
└─sda1                    8:1    0  1.8T  0 part
  └─md0                   9:0    0  1.8T  0 raid1
    └─os-root           253:0    0 17.3G  0 lvm   /
sdb                       8:16   0  1.8T  0 disk
└─sdb1                    8:17   0  1.8T  0 part
  └─md0                   9:0    0  1.8T  0 raid1
    └─os-root           253:0    0 17.3G  0 lvm   /
sdc                       8:32   0  1.8T  0 disk
sdd                       8:48   0  1.8T  0 disk
sde                       8:64   0  1.8T  0 disk
sdf                       8:80   0  1.8T  0 disk
sdg                       8:96   0  1.8T  0 disk
sdh                       8:112  0  1.8T  0 disk
└─md126                   9:126  0  7.3T  0 raid0
sdi                       8:128  0  1.8T  0 disk
└─md126                   9:126  0  7.3T  0 raid0
sdj                       8:144  0  1.8T  0 disk
└─md126                   9:126  0  7.3T  0 raid0
sdk                       8:160  0  1.8T  0 disk
└─md126                   9:126  0  7.3T  0 raid0
sdl                       8:176  0  1.8T  0 disk
├─backup-tools--bdsync  253:5    0    8T  0 lvm
└─backup-test           253:9    0   10G  0 lvm
sdm                       8:192  0  1.8T  0 disk
└─backup-tools--bdsync  253:5    0    8T  0 lvm
sdn                       8:208  0  1.8T  0 disk
└─backup-tools--bdsync  253:5    0    8T  0 lvm
sdo                       8:224  0  1.8T  0 disk
└─backup-tools--bdsync  253:5    0    8T  0 lvm
sdp                       8:240  0  1.8T  0 disk
└─backup-tools--bdsync  253:5    0    8T  0 lvm
sdq                      65:0    0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdr                      65:16   0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sds                      65:32   0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sdt                      65:48   0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdu                      65:64   0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdv                      65:80   0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdw                      65:96   0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdx                      65:112  0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sdy                      65:128  0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sdz                      65:144  0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sdaa                     65:160  0  1.8T  0 disk
└─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
sdab                     65:176  0  1.8T  0 disk
├─backup-tools          253:1    0   10T  0 lvm   /srv/eqiad/tools
└─backup-maps           253:2    0    4T  0 lvm   /srv/eqiad/maps
sdac                     65:192  0  1.8T  0 disk
└─backup-maps           253:2    0    4T  0 lvm   /srv/eqiad/maps
sdad                     65:208  0  1.8T  0 disk
└─backup-maps           253:2    0    4T  0 lvm   /srv/eqiad/maps
sdae                     65:224  0  1.8T  0 disk
└─backup-others         253:3    0    5T  0 lvm   /srv/eqiad/others
sdaf                     65:240  0  1.8T  0 disk
└─backup-others         253:3    0    5T  0 lvm   /srv/eqiad/others
sdag                     66:0    0  1.8T  0 disk
└─backup-others         253:3    0    5T  0 lvm   /srv/eqiad/others
sdah                     66:16   0  1.8T  0 disk
└─backup-others--bdsync 253:8    0 10.9T  0 lvm
sdai                     66:32   0  1.8T  0 disk

It seems we have a different count for presented disks, than a count previously reported as being incorrect https://phabricator.wikimedia.org/T102626#1379646, and different from the actual original expected in https://phabricator.wikimedia.org/T102626#1379646 (72)

@Papaul is headed onsite now.

The gist of all of this is: I don't trust this server at all.

I am proposing the following out of dire necessity:

  • We can hopefully stash a copy of existing backups from labstore2001 somewhere in codfw. If not we have labstore1001 online and a recent copy of tools and others on the soon-to-be cluster in eqiad of labstore1004/1005
  • We shutdown labstore2001, disconnect external arrays, check RAID cards and boot to see status. We want to know why the H800 isn't showing up. We steal the H800 from labstore2002 if necessary. We find a way to feel confident in labstore2002 hardware (including RAID cards).
  • We reimage labstore2001 without the external storage attached. This server is currently in a partially puppetized state. We already know the actual data volumes were not puppetized, and that at some point it had nfs-kernel-server installed and running errantly. The install was also done outside of netboot.cfg so all partitioning was done by hand. This current OS install is known-bad to me.
  • We use the internal H700 (assuming it's good) and present hw RAID 1 to the installer for the OS only.
  • Now labstore2001 is up on hw RAID for OS with known good status RAID cards and OS and puppetization
  • We attach the external storage arrays one-by-one and go through them to see what is good and what is not. We exclude what is not and figure out what actual shelves are good. We have at least 1 bad drive in one shelve, a mystery amount of bad shelves, and a probably bad battery backup.
  • We reinitialize backups onto known-good storage
  • We order a replacement H800 card for standby (assuming only 1 of the current 2 is good)
  • We need to order new hw for this setup soon-ish anyway. I believe this is all out of warranty and while this in theory ends in a consistent state it's not a long-term supportable one.

Change 319591 had a related patch set uploaded (by Madhuvishy):
labstore: Configure labstore-lvm-raid partman recipe for labstore2001

https://gerrit.wikimedia.org/r/319591

Change 319591 merged by Madhuvishy:
labstore: Configure labstore-lvm-raid partman recipe for labstore2001

https://gerrit.wikimedia.org/r/319591

Papaul added a comment.Nov 3 2016, 7:12 PM

Replaced the bad disk on labstore2002 with one disk from labstore2002. When I received the new disk i will put it into labstore2002

Before installation
H700 is showing all 12 disks
and second controller is showing 24 disks 12 disk from labstore-array0 and 12 disks for labsotre-array1

We disconnect all arrays from labstore2001 and did a fresh OS install.
The output of the sudo megacli -AdpAllInfo -aAll | grep PERC was stay showing only H700
We removed the controller from labstore2002 and placed it in labstore2001 same output.
we removed the controller and replaced it with the another type of controller and we had H700 and H800 showing .

we find out 2 issues
1 - the | grep PERC option listed only DELL PERC RAID controller or the controller that was in the system was a 3ware LSI controller and not a DELL PERC controller.
2- The drivers for the 3ware controller were not installed

@chasemp
if we want to use the 3ware LSI controller we have to start by installing the drivers.
if we decide to keep the H800 no problem but i don't see why we will keep the H800 if we have already the 3ware LSI which is powerfully in speed and number of disks it can support (127 disks)

Dzahn removed a subscriber: Dzahn.Nov 3 2016, 7:19 PM

@Papaul Thanks for the summary.

As far as I'm aware all the labstore boxes, and majority of the rest prod storage seems to rely on the H800 controllers, and we have it well supported. According to a comment on a similar thread about the 3ware cards, (https://phabricator.wikimedia.org/T127490#2157378) -

It /is/ possible to deal with this card and add support for it in check-raid etc., but if this is a one-off, let's please not and stick with our well-known ones.

Also, as for performance I compared specs on http://www.dell.com/learn/us/en/04/campaigns/dell-raid-controllers and http://www.newegg.com/Product/Product.aspx?Item=N82E16816118140 and it seems like they both support upto 6G SAS transfer rate, RAID 0/1/5/6/10/50. The H800 seems to support 192 drives max, and the 3ware 127. I'm not sure if there's any performance advantage to the 3ware controllers, based on this.

Given all this, I think we should stick to using H800's for this box, for consistency and to not add new hardware for us to support and maintain for the labstore boxes.

Papaul added a comment.Nov 3 2016, 8:30 PM

It is your call i don't know anything on the lab setup so anything your want for the box I will follow.

@Papaul I wanted to make sure our current thought process is documented. If the H800's work with our storage shelves and assuming we have spares, let's go with using them. Thank you :)

Papaul added a comment.Nov 3 2016, 9:21 PM

@madhuvishy @chasemp please see th link below for why H800 needed replacement.
https://rt.wikimedia.org/Ticket/Display.html?id=9430

Thanks for the ticket @Papaul. Helped me understand a bit of the history.

What I understand so far - The ticket claims that

The Dell H800 controllers have been giving odd, intermittent and outright freaky failures at random times. In pretty much all isolated cases, the fails appear isolated but there is a 30-month pattern of those on pretty much every labstore so I think it's wise to look for alternatives.

It looks like Chris looked for alternatives and found the 3ware controllers, and 4 were bought - 2 for eqiad and 2 for dallas. The 2 for eqiad seem to have never been installed. Labstore100[1-3] all have the H800's. There was only one installed in Dallas for labstore2001.

I tried to find phab tickets that reported these failures but all I found was T97688, T123740, and this ticket itself - I'm not sure what to conclude from these. If it was that the H800s were indeed unreliable and faulty, I'm not sure what the results of testing the new 3ware cards were, since it seems to have been experimentally put in for 2001 and nowhere else (T102786) - so not entirely convinced these are proven better in any way so far.

Papaul added a comment.Nov 4 2016, 3:06 PM

I think we need to have labstore2001 back up running on H800 controller a for now and work on making the 3ware controller work on labstore2002.

Papaul added a comment.Nov 7 2016, 2:04 PM

@madhuvishyi rebuild the RAID you should be able to see all disk now on H800. Let me know if you have any other questions.

Thanks

madhuvishy added a comment.EditedNov 7 2016, 10:24 PM

Thanks @Papaul! Labstore2001 is now up and running with 12 internal disks connected via H700 raid controller, and 48 external disks across 4 shelves connected via H800 raid controller.

Papaul closed this task as Resolved.Nov 16 2016, 3:44 PM

Closing this since the system is back up online.