Page MenuHomePhabricator

Reimage db2047 - check for hardware errors
Closed, ResolvedPublic

Description

[Original title: db2047 froze up and had to be hard rebooted (possible hardware error)]

I'm not sure what's up w/ it but seems like not entirely a software issue. Totally unresponsive, I was able to reboot via ilo.

Console was stuck with:

[5325460.730396] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:138]
[5325460.790463] BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:146]
[5325460.806480] BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:151]
[5325460.818494] BUG: soft lockup - CPU#4 stuck for 22s! [migration/4:156]
[5325460.834512] BUG: soft lockup - CPU#5 stuck for 22s! [migration/5:161]
[5325460.846526] BUG: soft lockup - CPU#6 stuck for 22s! [migration/6:166]
[5325460.910595] BUG: soft lockup - CPU#7 stuck for 22s! [migration/7:171]
[5325460.922608] BUG: soft lockup - CPU#8 stuck for 22s! [migration/8:176]
[5325460.938626] BUG: soft lockup - CPU#9 stuck for 22s! [migration/9:181]
[5325461.303028] BUG: soft lockup - CPU#21 stuck for 23s! [migration/21:242]
[5325461.319045] BUG: soft lockup - CPU#22 stuck for 23s! [migration/22:247]
[5325461.331058] BUG: soft lockup - CPU#23 stuck for 23s! [migration/23:252]
[5325461.407142] BUG: soft lockup - CPU#25 stuck for 22s! [migration/25:262]
[5325461.423160] BUG: soft lockup - CPU#26 stuck for 22s! [migration/26:267]

A few breadcrumbs post reboot:

syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 0 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 1 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 2 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 3 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 4 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 5 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 6 offline?, imc_log not set#012: No such file or directory
syslog:Apr  7 02:10:07 db2047 mcelog: Warning: cpu 7 offline?, imc_log not set#012: No such file or directory
mcelog:mcelog: Warning: cpu 0 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 1 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 2 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 3 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 4 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 5 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 6 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 7 offline?, imc_log not set
mcelog:mcelog: Warning: cpu 8 offline?, imc_log not set

Assigning to you @Volans just so you catch it in the morning to take a look.

Related Objects

Event Timeline

Change 282103 had a related patch set uploaded (by Volans):
Depool crashed db2047, needs to be reimaged

https://gerrit.wikimedia.org/r/282103

Change 282103 merged by jenkins-bot:
Depool crashed db2047, needs to be reimaged

https://gerrit.wikimedia.org/r/282103

Mentioned in SAL [2016-04-07T07:04:45Z] <volans@tin> Synchronized wmf-config/db-codfw.php: Depool crashed db2047, needs to be reimaged T132011 (duration: 00m 38s)

Depooled from mediawiki-config, in any case the DB needs to be reimported and given is on Trusty better to reimage it too.
I'll check for any sign of hardware issues too.

Thanks @chasemp

Volans renamed this task from db2047 froze up and had to be hard rebooted (possible hardware error) to Reimage db2047 - check for hardware errors.Apr 9 2016, 11:11 AM
Volans updated the task description. (Show Details)

Mentioned in SAL [2016-04-09T11:12:17Z] <volans> Disabling tendril on db2047 (needs to be reimaged) to avoid flooding logs of tendril DB - T132011

jcrespo triaged this task as Medium priority.
jcrespo moved this task from Pending comment to In progress on the DBA board.

Change 285161 had a related patch set uploaded (by Jcrespo):
Depool db2068 for cloning to db2047

https://gerrit.wikimedia.org/r/285161

Change 285161 merged by Jcrespo:
Depool db2068 for cloning to db2047

https://gerrit.wikimedia.org/r/285161

Change 285164 had a related patch set uploaded (by Jcrespo):
Set default db2047 installer to jessie

https://gerrit.wikimedia.org/r/285164

Change 285164 merged by Jcrespo:
Set default db2047 installer to jessie

https://gerrit.wikimedia.org/r/285164

jcrespo subscribed.

After reimage, I do not see anything significantly bad:

  • RAID is OK
cciss_vol_status --verbose /dev/sg0
Controller: Smart Array P420i
  Board ID: 0x3354103c
  Logical drives: 1
  Running firmware: 6.00
  ROM firmware: 6.00
/dev/sda: (Smart Array P420i) RAID 1 Volume 0 status: OK. 
  Physical drives: 12
         connector 1I box 1 bay 1                 HP      EF0600FARNA                          6SL9LV0E0000N519014S     HPD6 OK
         connector 1I box 1 bay 2                 HP      EF0600FARNA                          6SL9LW100000N519D9SD     HPD6 OK
         connector 1I box 1 bay 3                 HP      EF0600FARNA                          6SL9LVY50000N519DA36     HPD6 OK
         connector 1I box 1 bay 4                 HP      EF0600FARNA                          6SL9LTVF0000N51901NA     HPD6 OK
         connector 1I box 1 bay 5                 HP      EF0600FARNA                          6SL9LTGP0000N5206YH7     HPD6 OK
         connector 1I box 1 bay 6                 HP      EF0600FARNA                          6SL9LSLS0000N519294A     HPD6 OK
         connector 1I box 1 bay 7                 HP      EF0600FARNA                          6SL9LWWB0000N52001KJ     HPD6 OK
         connector 1I box 1 bay 8                 HP      EF0600FARNA                          6SL9LVZN0000N5190AT1     HPD6 OK
         connector 1I box 1 bay 9                 HP      EF0600FARNA                          6SL9LTV10000N519010X     HPD6 OK
         connector 1I box 1 bay 10                 HP      EF0600FARNA                          6SL9LSAE0000N518AGQA     HPD6 OK
         connector 1I box 1 bay 11                 HP      EF0600FARNA                          6SL9LVYB0000N5190B50     HPD6 OK
         connector 1I box 1 bay 12                 HP      EF0600FARNA                          6SL9LVZ50000N5190B0N     HPD6 OK
/dev/sg0: (Smart Array P420i) Enclosure Gen8 ServBP 12+2 (S/N: FZ4ABP1151) on Bus 0, Physical Port 1I status: OK.
/dev/sg0(Smart Array P420i:0): Non-Volatile Cache status:
                   Cache configured: Yes
                  Read cache memory: 81 MiB
                 Write cache memory: 735 MiB
                Write cache enabled: Yes
   Flash backed cache present

Non-worring perf sampling:

[  789.056721] perf interrupt took too long (4559 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[  816.962728] ip_tables: (C) 2000-2006 Netfilter Core Team
[  816.987149] nf_conntrack version 0.5.0 (32768 buckets, 262144 max)
[  890.426703] Process accounting resumed
[  890.456905] perf interrupt took too long (6425 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[  944.495570] perf interrupt took too long (10922 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[  990.724063] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 5541.232176] perf interrupt took too long (20409 > 20000), lowering kernel.perf_event_max_sample_rate to 6250

Sata link down? Normal?

[    2.811955] ata1.01: failed to resume link (SControl 0)
[    2.850503] ata1.00: SATA link down (SStatus 0 SControl 300)
[    2.850529] ata1.01: SATA link down (SStatus 4 SControl 0)
[    2.850573] ata2.01: failed to resume link (SControl 0)
[    2.876378] ata2.00: SATA link down (SStatus 0 SControl 300)
[    2.876414] ata2.01: SATA link down (SStatus 4 SControl 0)

But nothing regarding the CPUs.

Waiting for @Volans OK before pooling it.

jcrespo claimed this task.

I have repooled the server, but feel free to still give it a second check and reopen if you see something weird.