Page MenuHomePhabricator

db2034 crash
Closed, DuplicatePublic

Description

db2034 crashed and paged. When I logged in via mgmt, I got the following scrolling past when on VSP. No serial output otherwise just scrolling:

[10605869.309381] BUG: soft lockup - CPU#31 stuck for 22s! [migration/31:292]
[10605897.339575] BUG: soft lockup - CPU#31 stuck for 22s! [migration/31:292]
[10605925.369770] BUG: soft lockup - CPU#31 stuck for 22s! [migration/31:292]
[10605953.399966] BUG: soft lockup - CPU#31 stuck for 22s! [migration/31:292]

Event Timeline

RobH created this task.Jun 6 2016, 5:33 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 6 2016, 5:33 AM

Mentioned in SAL [2016-06-06T05:34:36Z] <robh> db2034 locked up via serial console. details on T137084, rebooting since its unresponsive to ssh or serial.

RobH added a comment.Jun 6 2016, 5:36 AM

I've rebooted the host in an attempt to return it back online. This should be flagged into notes for the host history (we don't really have a good way to do that now.)

For now I'm setting it to high priority and assigned to @jcrespo for his review.

RobH added a comment.Jun 6 2016, 5:39 AM

P3211 has the ilom log

RobH added a comment.Jun 6 2016, 5:43 AM

mysql isn't online, but im not sure if its as simple as just manually starting it, or if it has to be manually checked/synced. Since db2034 crashed and wasn't cleanly shut down, I don't want to assume I should just restart the db/mysql service.

jcrespo triaged this task as High priority.Jun 6 2016, 6:26 AM
jcrespo moved this task from Triage to In progress on the DBA board.

It seems there was a RAID controller failure:

A controller failure event occurred prior to this power-up

We had similar issues on T130702. We may need a general upgrade of all machines with similar models.

Restricted Application added a project: Operations. · View Herald TranscriptJun 6 2016, 6:30 AM
Restricted Application added a subscriber: Southparkfan. · View Herald Transcript

This host being down was creating log noise due to health checks (no users affected):

https://logstash.wikimedia.org/#dashboard/temp/AVUkao15_LTxu7wl9U3S