Page MenuHomePhabricator

cp4007 crashed
Closed, ResolvedPublic

Description

This is a ulsfo upload varnish server. It's depooled, and chasemp had to reboot it (unresponsive) from the console. Appears to be up/ok, but want to wait (for re-crash) and/or dig a bit before repooling.

syslog from crash:

Nov  4 16:51:24 cp4007 kernel: [14602465.718794] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Nov  4 16:51:24 cp4007 kernel: [14602465.727837] IP: [<ffffffff814900c8>] inet_getpeer+0x58/0x120
Nov  4 16:51:24 cp4007 kernel: [14602465.734450] PGD 0
Nov  4 16:51:24 cp4007 kernel: [14602465.736990] Oops: 0000 [#1] SMP
Nov  4 16:51:24 cp4007 kernel: [14602465.740894] Modules linked in: xfrm4_mode_transport seqiv esp6 xfrm6_mode_transport xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm
_algo dm_mod binfmt_misc cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc x86
_pkg_temp_thermal intel_powerclamp coretemp joydev kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt ipmi_devintf hid_generic iTCO_vendor_support evdev dcdbas aesni_intel ttm aes
_x86_64 lrw gf128mul drm_kms_helper glue_helper ablk_helper cryptd usbhid drm pcspkr sb_edac i2c_algo_bit hid edac_core i2c_core ipmi_si lpc_ich tpm_tis mfd_core tpm ipmi_msghandler 8250_fint
ek acpi_power_meter wmi acpi_pad button mei_me shpchp processor mei thermal_sys autofs4 ext4 crc16 mbcache jbd2 raid1 md_mod sg sd_mod ahci bnx2x ehci_pci libahci ptp ehci_hcd pps_core libata
 mdio crc32c_generic usbcore scsi_mod crc32c_intel usb_common libcrc32c
Nov  4 16:51:24 cp4007 kernel: [14602465.838488] CPU: 4 PID: 20543 Comm: nginx Tainted: G        W      3.19.0-1-amd64 #1 Debian 3.19.6-1
Nov  4 16:51:24 cp4007 kernel: [14602465.848966] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 1.6.0 03/07/2013
Nov  4 16:51:24 cp4007 kernel: [14602465.857603] task: ffff88194ed81270 ti: ffff8801623dc000 task.ti: ffff8801623dc000
Nov  4 16:51:24 cp4007 kernel: [14602465.866237] RIP: 0010:[<ffffffff814900c8>]  [<ffffffff814900c8>] inet_getpeer+0x58/0x120
Nov  4 16:51:24 cp4007 kernel: [14602465.875564] RSP: 0018:ffff8801623df828  EFLAGS: 00010246
Nov  4 16:51:24 cp4007 kernel: [14602465.881777] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000000
Nov  4 16:51:24 cp4007 kernel: [14602465.890024] RDX: 0000000000000001 RSI: ffff8801623df848 RDI: ffff881810b92198
Nov  4 16:51:24 cp4007 kernel: [14602465.898271] RBP: 000000000a0d76b6 R08: 0000000000000004 R09: 0000000042060126
Nov  4 16:51:24 cp4007 kernel: [14602465.906517] R10: 0000000000000000 R11: 000000000000001c R12: ffff881810b92199
Nov  4 16:51:24 cp4007 kernel: [14602465.914763] R13: ffff8801623dfb84 R14: ffffffff818c9e80 R15: ffff880169154240
Nov  4 16:51:24 cp4007 kernel: [14602465.923009] FS:  00007f9b89d04700(0000) GS:ffff88181fc40000(0000) knlGS:0000000000000000
Nov  4 16:51:24 cp4007 kernel: [14602465.932321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov  4 16:51:24 cp4007 kernel: [14602465.939017] CR2: 0000000000000010 CR3: 000000017967b000 CR4: 00000000000407e0
Nov  4 16:51:24 cp4007 kernel: [14602465.947263] Stack:
Nov  4 16:51:24 cp4007 kernel: [14602465.949792]  ffffffff818ca3c0 ffff881066875200 ffffffff81677641 ffffffff81508d6c
Nov  4 16:51:24 cp4007 kernel: [14602465.958365]  20f7014342060126 f6ba895be9984541 ffff88016915000a ffff880169154240
Nov  4 16:51:24 cp4007 kernel: [14602465.966934]  ffff881810b92140 ffff881066875200 ffff880169154240 ffffffff81507f2a
Nov  4 16:51:24 cp4007 kernel: [14602465.975510] Call Trace:
Nov  4 16:51:24 cp4007 kernel: [14602465.978532]  [<ffffffff81508d6c>] ? ipv6_cow_metrics+0xfc/0x160
Nov  4 16:51:24 cp4007 kernel: [14602465.985422]  [<ffffffff81507f2a>] ? ip6_rt_copy+0x25a/0x280
Nov  4 16:51:24 cp4007 kernel: [14602465.991925]  [<ffffffff815090ee>] ? ip6_pol_route.isra.41+0x2ae/0x410
Nov  4 16:51:24 cp4007 kernel: [14602465.999398]  [<ffffffff81509280>] ? ip6_pol_route_input+0x30/0x30
Nov  4 16:51:24 cp4007 kernel: [14602466.006488]  [<ffffffff81530e30>] ? fib6_rule_action+0xb0/0x1f0
Nov  4 16:51:24 cp4007 kernel: [14602466.013380]  [<ffffffff81474473>] ? fib_rules_lookup+0xe3/0x160
Nov  4 16:51:24 cp4007 kernel: [14602466.020272]  [<ffffffff81530fb3>] ? fib6_rule_lookup+0x43/0x80
Nov  4 16:51:24 cp4007 kernel: [14602466.027069]  [<ffffffff81509280>] ? ip6_pol_route_input+0x30/0x30
Nov  4 16:51:24 cp4007 kernel: [14602466.034154]  [<ffffffff814f8f5b>] ? ip6_dst_lookup_tail+0x27b/0x2a0
Nov  4 16:51:24 cp4007 kernel: [14602466.041444]  [<ffffffff81459965>] ? validate_xmit_skb.isra.92.part.93+0x15/0x2f0
Nov  4 16:51:24 cp4007 kernel: [14602466.049982]  [<ffffffff814f8fcc>] ? ip6_dst_lookup_flow+0x2c/0x80
Nov  4 16:51:24 cp4007 kernel: [14602466.057070]  [<ffffffff81529aae>] ? inet6_csk_route_socket+0x11e/0x180
Nov  4 16:51:24 cp4007 kernel: [14602466.064638]  [<ffffffff81529b44>] ? inet6_csk_xmit+0x34/0xc0
Nov  4 16:51:24 cp4007 kernel: [14602466.071239]  [<ffffffff814adcb5>] ? tcp_transmit_skb+0x495/0x950
Nov  4 16:51:24 cp4007 kernel: [14602466.078228]  [<ffffffff814ae35c>] ? tcp_write_xmit+0x1ec/0xd50
Nov  4 16:51:24 cp4007 kernel: [14602466.085022]  [<ffffffff81446900>] ? __alloc_skb+0x50/0x1f0
Nov  4 16:51:24 cp4007 kernel: [14602466.091428]  [<ffffffff814af11a>] ? __tcp_push_pending_frames+0x2a/0xc0
Nov  4 16:51:24 cp4007 kernel: [14602466.099093]  [<ffffffff814a1ed5>] ? tcp_sendmsg+0x6e5/0xd00
Nov  4 16:51:24 cp4007 kernel: [14602466.105597]  [<ffffffff8143ca23>] ? sock_aio_write+0x113/0x130
Nov  4 16:51:24 cp4007 kernel: [14602466.112392]  [<ffffffff811be54f>] ? do_sync_write+0x5f/0x90
Nov  4 16:51:24 cp4007 kernel: [14602466.118899]  [<ffffffff811bef55>] ? vfs_write+0x175/0x1f0
Nov  4 16:51:24 cp4007 kernel: [14602466.125202]  [<ffffffff811bfa22>] ? SyS_write+0x42/0xb0
Nov  4 16:51:24 cp4007 kernel: [14602466.131318]  [<ffffffff81203794>] ? SyS_epoll_wait+0xb4/0xe0
Nov  4 16:51:24 cp4007 kernel: [14602466.137919]  [<ffffffff8155108d>] ? system_call_fast_compare_end+0xc/0x11
Nov  4 16:51:24 cp4007 kernel: [14602466.145777] Code: 00 eb 12 41 83 eb 01 48 8b 09 74 38 48 81 f9 80 ad 67 81 74 2f 45 31 c0 66 83 fb 02 41 0f 95 c0 31 c0 47 8d 44 40 01 44 8b 0c 86 <44> 8b
 54 81 10 45 39 d1 74 26 45 39 ca 77 c9 41 83 eb 01 48 8b
Nov  4 16:51:24 cp4007 kernel: [14602466.167649] RIP  [<ffffffff814900c8>] inet_getpeer+0x58/0x120
Nov  4 16:51:24 cp4007 kernel: [14602466.174356]  RSP <ffff8801623df828>
Nov  4 16:51:24 cp4007 kernel: [14602466.178532] CR2: 0000000000000010
Nov  4 16:51:24 cp4007 kernel: [14602466.183229] ---[ end trace 6f405580d850d9ee ]---
[ still other misc normal syslog lines through 16:56:01 - the no data until reboot ]

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added a project: Traffic.
BBlack subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I did some digging, but couldn't identify an obvious commit which might explain this.

Independant of that I'm planning to update our kernel to the latest 3.19.8-ckt9 on Friday, it brings a couple of ipv6-related patches.

re-pooling on the assumption this was a rare kernel bug, been stable since

[cp4007:~] $ uptime
22:22:20 up 21 days, 5:08, 1 user, load average: 3.15, 3.65, 3.52
[cp4007:~] $ uname -a
Linux cp4007 3.19.0-1-amd64 #1 SMP Debian 3.19.3-7 (2015-07-20) x86_64 GNU/Linux

has been stable since 3 weeks. and the 3.19.8 update has not happened yet, so it's also not that that fixed an issue that would have happened more often

Dzahn claimed this task.
Dzahn reassigned this task from Dzahn to MoritzMuehlenhoff.
Dzahn set Security to None.

Agreeed. Also, the cp* hosts will be updated to 4.3 anyway.