Page MenuHomePhabricator

labstore1002 issues while trying to reboot
Closed, ResolvedPublic

Description

While trying to bring labstore1002 up this morning, it failed to boot again. After three tries, it finally passed POST but as soon as the OS started up it started getting I/O errors from the H800 controller.

It's not immediately clear whether the issue is with the server or the controller but given that they are attached to the live DAS serving as primary store for all of Labs, it seemed unwise to keep it up and risk issues impacting the shelves.

We need to diagnose this in a safe way.

Event Timeline

coren raised the priority of this task from to High.
coren updated the task description. (Show Details)
coren added projects: Cloud-VPS, ops-eqiad.
coren subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@mark suggest it might be worthwhile to ensure that the labstores and their shelves are all on the same phase to avoid the possibility of an electrically-induced problem.

I verified that both labstores and their shelves are on the same phase.

-CJ

Any updates on this? It's currently the main labstore server - is it considered reliable now? Did we swap out any hardware?

@yuvipanda: No, the hardware is known to have issues - though up to date it's always been fully working once it gets working at all (all the issues seem to happen when the machine boots).

That said - I don't trust the controller one whit. I do want to move away from labstore1002 as quickly as possible (see T106479) and actually diagnose the controller. I suggest ripping it out and burning it as a diagnosis method :-)

Heh, it did thankfully work when it was rebooted last time accidentally.

(thankfully - let's not do that again, etc)

So...should we close this as invalid or ?

No, we still haven't done serious testing of that hardware (viz. T101471) so at best labstore1002 is a dubiously reliable backup atm. This task reflects one of the issues with that server than needs to be confirmed and isolated.

I'm attempting to reproduce the issue without any shelf attached as a first attempt at isolating where the actual issue lies. If that happens with no shelf attached when we know for sure it's the controller that is wonky.

I've gotten the problem to reproduce once out of 17 attempts with POST stalling at F/W Initializing Devices 0%; which is the original issues.

FWIW, I've alternated between three methods when trying to reproduce:

  • racadm serveraction hardreset
  • racadm serveraction powercycle
  • reboot from the OS

The problem happened with a powercycle - but a point of note is that when we did the original switch to labstore1002, it occured with hardreset too.

@Cmjohnson: Next step is to swap out the H800 from that server and spend a couple hours rebooting it to see if we can feel confident it doesn't happen again. I'll do the latter if you can handle the former. :-)

@coren: we no longer have the spar h800 we used in labstore1001 last month.

@Cmjohnson: We need to get our hands on another then; perhaps put in a req for one or is there one in Dallas we could ship accross?

@Cmjohnson: so where is the original controller in labstore1001 now, and what was wrong with it?

If it's not usable, let's order another one new ASAP.

I do not think there were any real problems with it but @coren wanted to replace it jic there was a h/w issue.

@Cmjohnson, do we have any ticket or other communication on that? I'm wondering what the issue was, i'm only aware of issues with labstore1002.

Either way, let's get at least one new H800 card in ASAP.

RobH mentioned this in Unknown Object (Task).EditedDec 18 2015, 6:13 PM
RobH removed a project: procurement.
RobH set Security to None.
RobH subscribed.

I've created T121893 for the pricing/quoting/ordering of the controllers. Keeping that order to a sub-task will allow this troubleshooting task to remain public.

The new H800 card has been installed. We should probably schedule a time/day to move to ls1002

chasemp closed subtask Unknown Object (Task) as Resolved.Feb 18 2016, 4:12 PM