Page MenuHomePhabricator

Kernel errors on rendering hosts
Closed, ResolvedPublic

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added a project: SRE.
Andrew subscribed.

Lots of icinga alerts for this host. dmesg says:

[10660830.146766] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
[10660830.153540] invalid opcode: 0000 [#38] SMP
[10660830.157993] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT xt_pkttype iptable_raw ip6table_raw ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp stp mrp llc intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich ipmi_devintf dcdbas lpc_ich acpi_power_meter shpchp joydev serio_raw i7core_edac edac_core ipmi_si mac_hid lp parport hid_generic psmouse usbhid hid pata_acpi bnx2
[10660830.209361] CPU: 0 PID: 27524 Comm: convert Tainted: G D 3.13.0-24-generic #47-Ubuntu
[10660830.218301] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.5.3 10/22/2010
[10660830.226028] task: ffff8801a7140000 ti: ffff880101a0a000 task.ti: ffff880101a0a000
[10660830.233754] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.242448] RSP: 0000:ffff880101a0bd98 EFLAGS: 00010246
[10660830.248007] RAX: 0000000000000100 RBX: 00007feb39205148 RCX: ffff880101a0bb18
[10660830.255387] RDX: ffff8801a7140000 RSI: 0000000000000000 RDI: 80000001d7e009e6
[10660830.262766] RBP: ffff880101a0be20 R08: 0000000000000000 R09: 00000000000000a9
[10660830.270145] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880051d5fe48
[10660830.277525] R13: ffff8800358b8300 R14: ffff8801a557d080 R15: 0000000000000080
[10660830.284905] FS: 00007feb57105700(0000) GS:ffff88032fc00000(0000) knlGS:0000000000000000
[10660830.293237] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10660830.299230] CR2: 00007feb39205148 CR3: 00000001a5f80000 CR4: 00000000000007f0
[10660830.306610] Stack:
[10660830.308874] ffff880101a0be20 0000000000000000 ffff880101a0bf20 0000000000000000
[10660830.316553] 0000000000000001 8000000000000867 ffffea0005649070 00000000000000a9
[10660830.324231] 80000001d130c867 ffff880159241ff8 ffff8800000000a9 0000000000000006
[10660830.331912] Call Trace:
[10660830.334614] [<ffffffff81721a24>] do_page_fault+0x184/0x560
[10660830.340647] [<ffffffff811112fc>] ? acct_account_cputime+0x1c/0x20
[10660830.347072] [<ffffffff8109d76b>] ? account_user_time+0x8b/0xa0
[10660830.353238] [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
[10660830.359490] [<ffffffff81721e1a>] do_page_fault+0x1a/0x70
[10660830.365137] [<ffffffff8171e288>] page_fault+0x28/0x30
[10660830.370521] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
[10660830.390232] RIP [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.396577] RSP <ffff880101a0bd98>
[10660830.400316] ------------[ cut here ]------------
[10660830.400386] ---[ end trace a78a871b7413c02f ]---
[10660830.411146] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
[10660830.417917] invalid opcode: 0000 [#39] SMP
[10660830.422366] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT xt_pkttype iptable_raw ip6table_raw ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp stp mrp llc intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich ipmi_devintf dcdbas lpc_ich acpi_power_meter shpchp joydev serio_raw i7core_edac edac_core ipmi_si mac_hid lp parport hid_generic psmouse usbhid hid pata_acpi bnx2
[10660830.473650] CPU: 2 PID: 27526 Comm: convert Tainted: G D 3.13.0-24-generic #47-Ubuntu
[10660830.482586] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.5.3 10/22/2010
[10660830.490311] task: ffff8801a4502fe0 ti: ffff880124cf2000 task.ti: ffff880124cf2000
[10660830.498034] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.506721] RSP: 0000:ffff880124cf3d98 EFLAGS: 00010246
[10660830.512277] RAX: 0000000000000100 RBX: 00007feb39205188 RCX: ffff880124cf3b18
[10660830.519653] RDX: ffff8801a4502fe0 RSI: 0000000000000000 RDI: 80000001d7e009e6
[10660830.527032] RBP: ffff880124cf3e20 R08: 0000000000000000 R09: 00000000000000a9
[10660830.534409] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880051d5fe48
[10660830.541785] R13: ffff8800358b8300 R14: ffff8801a557d080 R15: 0000000000000080
[10660830.549163] FS: 00007feb4f103700(0000) GS:ffff88032fc20000(0000) knlGS:0000000000000000
[10660830.557492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10660830.563481] CR2: 00007feb39205188 CR3: 00000001a5f80000 CR4: 00000000000007e0
[10660830.570858] Stack:
[10660830.573121] ffff880124cf3e20 0000000000000000 ffff880124cf3f20 0000000000000000
[10660830.580793] 0000000000000001 0000000000000000 0000000000000075 000000000000a89c
[10660830.588467] 0000000080000000 0000000000000000 00000000000000a9 0000000000000006
[10660830.596143] Call Trace:
[10660830.598841] [<ffffffff81721a24>]
do_page_fault+0x184/0x560
[10660830.604833] [<ffffffff811112fc>] ? acct_account_cputime+0x1c/0x20
[10660830.611257] [<ffffffff8109d76b>] ? account_user_time+0x8b/0xa0
[10660830.617422] [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
[10660830.623674] [<ffffffff81721e1a>] do_page_fault+0x1a/0x70
[10660830.629319] [<ffffffff8171e288>] page_fault+0x28/0x30
[10660830.634703] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
[10660830.654318] RIP [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.660665] RSP <ffff880124cf3d98>
[10660830.664418] ---[ end trace a78a871b7413c030 ]---

This looks to be the same as https://phabricator.wikimedia.org/T107698

I'm going to update the kernel on that box and reboot, and repool if it seems happy.

Andrew renamed this task from Kernel errors on mw1158 to Kernel errors on rendering hosts.Nov 18 2015, 1:36 AM
Andrew set Security to None.

I ran apt-get install linux-headers-3.13.0-62 linux-headers-3.13.0-62-generic linux-image-3.13.0-62-generic linux-image-extra-3.13.0-62-generic linux-tools-3.13.0-62 linux-tools-3.13.0-62-generic on mw1158, rebooted, and repooled.

Now I've depooled 1156 and will do the same.

1156 is now upgraded and seems ok. Moritz, I will plan on upgrading the kernels of 1153,54, 55, 57, 59 and 60 tomorrow unless you disagree.

Andrew claimed this task.

All of the rendering servers (mw1153-mw1160) are now running 3.13.0-62-generic. Note that that buggy kernel is surely running on lots of other bits of our cluster.