Page MenuHomePhabricator

Investigate need and candidate for labstore100(1|2) kernel upgrade
Closed, DuplicatePublic

Description

We have several crashes in the near past that are suspected because of an unknown kernel bug.

Dec 16 04:46:05 labstore1001 kernel: [4198858.858163] INFO: rcu_sched self-detected stall on CPU { 6}  (t=5251 jiffies g=572267666 c=572267665 q=1115175)
Dec 16 04:46:05 labstore1001 kernel: [4198858.866170] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=5252 jiffies, g=572267666, c=572267665, q=1115175)
Dec 16 04:46:05 labstore1001 kernel: [4198858.866170] Task dump for CPU 6:
Dec 16 04:46:05 labstore1001 kernel: [4198858.866173] kworker/6:2     R  running task        0  9169      2 0x00000008
Dec 16 04:46:05 labstore1001 kernel: [4198858.866194] Workqueue: kcopyd do_work [dm_mod]
Dec 16 04:46:05 labstore1001 kernel: [4198858.866196]  ffff88081a5ae800 ffff88081aba5de0 ffff88081a5ae800 ffff88081aba5de0
Dec 16 04:46:05 labstore1001 kernel: [4198858.866197]  ffff8807082fb900 0000000000000286 ffffffffa00bbff1 ffff88041c1aa000
Dec 16 04:46:05 labstore1001 kernel: [4198858.866198]  ffff880400001000 0000000000000001 ffff88081a5ae820 0000000000000001
Dec 16 04:46:05 labstore1001 kernel: [4198858.866199] Call Trace:
Dec 16 04:46:05 labstore1001 kernel: [4198858.866207]  [<ffffffffa00bbff1>] ? __make_request+0xc11/0xe20 [raid10]
Dec 16 04:46:05 labstore1001 kernel: [4198858.866213]  [<ffffffff8109b847>] ? select_idle_sibling+0x27/0x120
Dec 16 04:46:05 labstore1001 kernel: [4198858.866215]  [<ffffffff8109eaf9>] ? enqueue_task_fair+0x309/0xf40
Dec 16 04:46:05 labstore1001 kernel: [4198858.866220]  [<ffffffff81155d88>] ? free_one_page+0x78/0x480
Dec 16 04:46:05 labstore1001 kernel: [4198858.866224]  [<ffffffff8101d1d6>] ? native_sched_clock+0x26/0x90
Dec 16 04:46:05 labstore1001 kernel: [4198858.866226]  [<ffffffff8101d245>] ? sched_clock+0x5/0x10
Dec 16 04:46:05 labstore1001 kernel: [4198858.866229]  [<ffffffffa0563c50>] ? copy_callback+0x40/0x100 [dm_snapshot]
Dec 16 04:46:05 labstore1001 kernel: [4198858.866231]  [<ffffffffa0563c10>] ? complete_exception+0x50/0x50 [dm_snapshot]
Dec 16 04:46:05 labstore1001 kernel: [4198858.866235]  [<ffffffffa00cdd71>] ? run_complete_job+0x61/0xb0 [dm_mod]
Dec 16 04:53:09 labstore1001 kernel: [4199283.625421] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 04:53:37 labstore1001 kernel: [4199311.647538] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 04:54:13 labstore1001 kernel: [4199347.675975] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:2:9169]
Dec 16 04:55:57 labstore1001 kernel: [4199451.758125] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:2:9169]
Dec 16 04:57:41 labstore1001 kernel: [4199555.840273] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:2:9169]
Dec 16 04:58:09 labstore1001 kernel: [4199583.862390] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:2:9169]
Dec 16 04:59:33 labstore1001 kernel: [4199667.928740] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 05:01:13 labstore1001 kernel: [4199768.007731] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 05:01:41 labstore1001 kernel: [4199796.029849] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 05:04:09 labstore1001 kernel: [4199944.146753] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 05:06:05 labstore1001 kernel: [4200060.238384] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169]
Dec 16 05:08:53 labstore1001 kernel: [4200228.371088] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:2:9169

Event Timeline

chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
faidon subscribed.

Have we seen those in labstore1002? The ones I see in labstore1001 right now all look like being I/O related (I/O being slow/getting stuck), I wouldn't be so quick to call it a kernel bug.

I wasn't looking for it on labstore1002 thinking whatever weird condition is in question here would show up on the effective server, I don't know if this a sane proposition or not really. The kernel bug bit was part of the summary from https://wikitech.wikimedia.org/wiki/Incident_documentation/20151216-Labs-NFS. I'm not sure who wrote likely a Linux kernel bug. I specified both servers here as my thought was to keep them consistent.

We (me/yuvi/mark/andrew) had a post-all-the-things meetings on monday that I think you missed where the general consensus was upgrading the kernel was a reasonable step and worth doing.

I spoke with moritz and he said "I installed the latest 3.19 kernel on labstore1002, only needs a reboot now".

Let me know what you want to do here, maybe this task is a noop then.

We also seem to get kernel backtraces every day at 01:00 (seems like LVM snapshot related, probably some periodic job?):

Dec 22 01:00:15 labstore1001 kernel: [492385.744185] ------------[ cut here ]------------
Dec 22 01:00:15 labstore1001 kernel: [492385.744197] WARNING: CPU: 7 PID: 32619 at /tmp/buildd/linux-3.19.3/kernel/workqueue.c:4189 destroy_workqueue+0xa0/0x230()
Dec 22 01:00:15 labstore1001 kernel: [492385.744199] Modules linked in: quota_v2 quota_tree dm_snapshot dm_bufio binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ttm cryptd drm_kms_helper ipmi_devintf joydev psmouse evdev serio_raw pcspkr dcdbas drm i2c_algo_bit i2c_core lpc_ich ipmi_si i7core_edac acpi_power_meter mfd_core ipmi_msghandler edac_core tpm_tis 8250_fintek tpm button processor shpchp thermal_sys autofs4 hid_generic usbhid hid ext4 crc16 mbcache jbd2 dm_mod raid1 raid10 md_mod sg sd_mod uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod crc32c_intel usb_common bnx2
Dec 22 01:00:15 labstore1001 kernel: [492385.744265] CPU: 7 PID: 32619 Comm: lvremove Tainted: G        W      3.19.0-1-amd64 #1 Debian 3.19.3-7
Dec 22 01:00:15 labstore1001 kernel: [492385.744267] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.9.0 10/21/2011
Dec 22 01:00:15 labstore1001 kernel: [492385.744269]  0000000000000000 ffffffff8171e2b8 ffffffff8154ca3b 0000000000000000
Dec 22 01:00:15 labstore1001 kernel: [492385.744272]  ffffffff8106d001 ffff8800cadfac00 ffff8800cadfac20 ffff88041c05de00
Dec 22 01:00:15 labstore1001 kernel: [492385.744275]  ffff88041c05de00 ffffffffa01198e0 ffffffff81083f50 ffff88002766d600
Dec 22 01:00:15 labstore1001 kernel: [492385.744278] Call Trace:
Dec 22 01:00:15 labstore1001 kernel: [492385.744286]  [<ffffffff8154ca3b>] ? dump_stack+0x40/0x50
Dec 22 01:00:15 labstore1001 kernel: [492385.744292]  [<ffffffff8106d001>] ? warn_slowpath_common+0x81/0xb0
Dec 22 01:00:15 labstore1001 kernel: [492385.744302]  [<ffffffffa01198e0>] ? table_load+0x340/0x340 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744306]  [<ffffffff81083f50>] ? destroy_workqueue+0xa0/0x230
Dec 22 01:00:15 labstore1001 kernel: [492385.744313]  [<ffffffffa011bfee>] ? dm_kcopyd_client_destroy+0x9e/0xf0 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744317]  [<ffffffff8155072e>] ? down_write+0xe/0x40
Dec 22 01:00:15 labstore1001 kernel: [492385.744322]  [<ffffffffa054d3c0>] ? snapshot_dtr+0xb0/0x170 [dm_snapshot]
Dec 22 01:00:15 labstore1001 kernel: [492385.744330]  [<ffffffffa0115f50>] ? dm_table_destroy+0x70/0x130 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744337]  [<ffffffffa01198e0>] ? table_load+0x340/0x340 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744344]  [<ffffffffa01199c2>] ? dev_suspend+0xe2/0x260 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744350]  [<ffffffffa011a2c3>] ? ctl_ioctl+0x203/0x4c0 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744359]  [<ffffffffa011a593>] ? dm_ctl_ioctl+0x13/0x20 [dm_mod]
Dec 22 01:00:15 labstore1001 kernel: [492385.744363]  [<ffffffff811d2ea8>] ? do_vfs_ioctl+0x2e8/0x4f0
Dec 22 01:00:15 labstore1001 kernel: [492385.744368]  [<ffffffff8123e478>] ? ipcget+0x138/0x1c0
Dec 22 01:00:15 labstore1001 kernel: [492385.744371]  [<ffffffff811d3131>] ? SyS_ioctl+0x81/0xa0
Dec 22 01:00:15 labstore1001 kernel: [492385.744375]  [<ffffffff8155244d>] ? system_call_fast_compare_end+0xc/0x11
Dec 22 01:00:15 labstore1001 kernel: [492385.744378] ---[ end trace 80567c43fadf732c ]---

…so a kernel upgrade doesn't sound like a bad idea in general.

I'm not sure if an upgrade to the latest 3.19 would be enough though, but it's at least a first step — depends on how much you want to minimize the amount of reboots. I don't have any objections in general, I'm just pointing out that we should keep an eye out for other issues (e.g. disk/hardware ones) while troubleshooting this.

4.4 should be available in jessie-wikimedia early next week

fgiunchedi subscribed.

adding labs too, ATM this is the situation kernel-wise:

$ ssh labstore1001.eqiad.wmnet uname -a
Linux labstore1001 3.19.0-1-amd64 #1 SMP Debian 3.19.3-7 (2015-07-20) x86_64 GNU/Linux
$ ssh labstore1002.eqiad.wmnet uname -a
Linux labstore1002 3.19.0-2-amd64 #1 SMP Debian 3.19.3-9 (2016-01-04) x86_64 GNU/Linux

anything to be done?

We are basically in a holding pattern as we (...well me I guess) tries to get labstore2003/2004 going so we can shift load so we can do stuff like this. This has fallen by the wayside partially as it will be a consequence of reimage I imagine already and it hasn't happened in months.

Both hosts should be migrated to Linux 4.4, 3.19 is deprecated at this point.

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).