Event Timeline
Lots of icinga alerts for this host. dmesg says:
[10660830.146766] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
[10660830.153540] invalid opcode: 0000 [#38] SMP
[10660830.157993] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT xt_pkttype iptable_raw ip6table_raw ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp stp mrp llc intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich ipmi_devintf dcdbas lpc_ich acpi_power_meter shpchp joydev serio_raw i7core_edac edac_core ipmi_si mac_hid lp parport hid_generic psmouse usbhid hid pata_acpi bnx2
[10660830.209361] CPU: 0 PID: 27524 Comm: convert Tainted: G D 3.13.0-24-generic #47-Ubuntu
[10660830.218301] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.5.3 10/22/2010
[10660830.226028] task: ffff8801a7140000 ti: ffff880101a0a000 task.ti: ffff880101a0a000
[10660830.233754] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.242448] RSP: 0000:ffff880101a0bd98 EFLAGS: 00010246
[10660830.248007] RAX: 0000000000000100 RBX: 00007feb39205148 RCX: ffff880101a0bb18
[10660830.255387] RDX: ffff8801a7140000 RSI: 0000000000000000 RDI: 80000001d7e009e6
[10660830.262766] RBP: ffff880101a0be20 R08: 0000000000000000 R09: 00000000000000a9
[10660830.270145] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880051d5fe48
[10660830.277525] R13: ffff8800358b8300 R14: ffff8801a557d080 R15: 0000000000000080
[10660830.284905] FS: 00007feb57105700(0000) GS:ffff88032fc00000(0000) knlGS:0000000000000000
[10660830.293237] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10660830.299230] CR2: 00007feb39205148 CR3: 00000001a5f80000 CR4: 00000000000007f0
[10660830.306610] Stack:
[10660830.308874] ffff880101a0be20 0000000000000000 ffff880101a0bf20 0000000000000000
[10660830.316553] 0000000000000001 8000000000000867 ffffea0005649070 00000000000000a9
[10660830.324231] 80000001d130c867 ffff880159241ff8 ffff8800000000a9 0000000000000006
[10660830.331912] Call Trace:
[10660830.334614] [<ffffffff81721a24>] do_page_fault+0x184/0x560
[10660830.340647] [<ffffffff811112fc>] ? acct_account_cputime+0x1c/0x20
[10660830.347072] [<ffffffff8109d76b>] ? account_user_time+0x8b/0xa0
[10660830.353238] [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
[10660830.359490] [<ffffffff81721e1a>] do_page_fault+0x1a/0x70
[10660830.365137] [<ffffffff8171e288>] page_fault+0x28/0x30
[10660830.370521] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
[10660830.390232] RIP [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.396577] RSP <ffff880101a0bd98>
[10660830.400316] ------------[ cut here ]------------
[10660830.400386] ---[ end trace a78a871b7413c02f ]---
[10660830.411146] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
[10660830.417917] invalid opcode: 0000 [#39] SMP
[10660830.422366] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT xt_pkttype iptable_raw ip6table_raw ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp stp mrp llc intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich ipmi_devintf dcdbas lpc_ich acpi_power_meter shpchp joydev serio_raw i7core_edac edac_core ipmi_si mac_hid lp parport hid_generic psmouse usbhid hid pata_acpi bnx2
[10660830.473650] CPU: 2 PID: 27526 Comm: convert Tainted: G D 3.13.0-24-generic #47-Ubuntu
[10660830.482586] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.5.3 10/22/2010
[10660830.490311] task: ffff8801a4502fe0 ti: ffff880124cf2000 task.ti: ffff880124cf2000
[10660830.498034] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.506721] RSP: 0000:ffff880124cf3d98 EFLAGS: 00010246
[10660830.512277] RAX: 0000000000000100 RBX: 00007feb39205188 RCX: ffff880124cf3b18
[10660830.519653] RDX: ffff8801a4502fe0 RSI: 0000000000000000 RDI: 80000001d7e009e6
[10660830.527032] RBP: ffff880124cf3e20 R08: 0000000000000000 R09: 00000000000000a9
[10660830.534409] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880051d5fe48
[10660830.541785] R13: ffff8800358b8300 R14: ffff8801a557d080 R15: 0000000000000080
[10660830.549163] FS: 00007feb4f103700(0000) GS:ffff88032fc20000(0000) knlGS:0000000000000000
[10660830.557492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10660830.563481] CR2: 00007feb39205188 CR3: 00000001a5f80000 CR4: 00000000000007e0
[10660830.570858] Stack:
[10660830.573121] ffff880124cf3e20 0000000000000000 ffff880124cf3f20 0000000000000000
[10660830.580793] 0000000000000001 0000000000000000 0000000000000075 000000000000a89c
[10660830.588467] 0000000080000000 0000000000000000 00000000000000a9 0000000000000006
[10660830.596143] Call Trace:
[10660830.598841] [<ffffffff81721a24>] do_page_fault+0x184/0x560
[10660830.604833] [<ffffffff811112fc>] ? acct_account_cputime+0x1c/0x20
[10660830.611257] [<ffffffff8109d76b>] ? account_user_time+0x8b/0xa0
[10660830.617422] [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
[10660830.623674] [<ffffffff81721e1a>] do_page_fault+0x1a/0x70
[10660830.629319] [<ffffffff8171e288>] page_fault+0x28/0x30
[10660830.634703] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
[10660830.654318] RIP [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[10660830.660665] RSP <ffff880124cf3d98>
[10660830.664418] ---[ end trace a78a871b7413c030 ]---
This looks to be the same as https://phabricator.wikimedia.org/T107698
I'm going to update the kernel on that box and reboot, and repool if it seems happy.
I ran apt-get install linux-headers-3.13.0-62 linux-headers-3.13.0-62-generic linux-image-3.13.0-62-generic linux-image-extra-3.13.0-62-generic linux-tools-3.13.0-62 linux-tools-3.13.0-62-generic on mw1158, rebooted, and repooled.
Now I've depooled 1156 and will do the same.
1156 is now upgraded and seems ok. Moritz, I will plan on upgrading the kernels of 1153,54, 55, 57, 59 and 60 tomorrow unless you disagree.
All of the rendering servers (mw1153-mw1160) are now running 3.13.0-62-generic. Note that that buggy kernel is surely running on lots of other bits of our cluster.
Auditing over servers will be handled through https://phabricator.wikimedia.org/T119411