Page MenuHomePhabricator

cp3057 crash (was: network down)
Open, MediumPublic

Description

Icinga:

[11:17] <icinga-wm> PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100%

I thought about T238305 first, but after connecting through serial port, the host seems up and responsive, so probably network?

root@cp3058:~$ ping cp3057.esams.wmnet
PING cp3057.esams.wmnet(cp3057.esams.wmnet (2620:0:862:102:10:20:0:57)) 56 data bytes
From cp3058.esams.wmnet (2620:0:862:102:10:20:0:58) icmp_seq=1 Destination unreachable: Address unreachable
From cp3058.esams.wmnet (2620:0:862:102:10:20:0:58) icmp_seq=2 Destination unreachable: Address unreachable
From cp3058.esams.wmnet (2620:0:862:102:10:20:0:58) icmp_seq=3 Destination unreachable: Address unreachable
From cp3058.esams.wmnet (2620:0:862:102:10:20:0:58) icmp_seq=4 Destination unreachable: Address unreachable
^C
--- cp3057.esams.wmnet ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4053ms

root@cp3058:~$ ping -4 cp3057.esams.wmnet
PING cp3057.esams.wmnet (10.20.0.57) 56(84) bytes of data.
From cp3058.esams.wmnet (10.20.0.58) icmp_seq=1 Destination Host Unreachable
From cp3058.esams.wmnet (10.20.0.58) icmp_seq=2 Destination Host Unreachable
From cp3058.esams.wmnet (10.20.0.58) icmp_seq=3 Destination Host Unreachable
From cp3058.esams.wmnet (10.20.0.58) icmp_seq=4 Destination Host Unreachable
^C
--- cp3057.esams.wmnet ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4007ms
pipe 4

Edit: it looks that it had soft-crashed for a while, but later fully as seen on logs, so most likely T238305 :-(

Event Timeline

jcrespo created this task.Mon, Feb 3, 11:31 AM
Restricted Application added a project: Operations. · View Herald TranscriptMon, Feb 3, 11:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2020-02-03T11:38:00Z] <ema> powercycle cp3057 T244127 T238305

ema triaged this task as Medium priority.Mon, Feb 3, 11:50 AM
ema added a comment.Mon, Feb 3, 11:55 AM

The host went down at 11:17 according to icinga, and the following warning was reported a little earlier to netconsole. Unfortunately, we currently cannot tell which host sent which message to netconsole. It seems extremely likely that this came from cp3057 however, given that no other upload@esams host has a matching warning in dmesg.

Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.719977] ------------[ cut here ]------------
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.719983] WARNING: CPU: 40 PID: 1 at /build/linux-sdMcHj/linux-4.9.189/net/core/netpoll.c:171 netpoll_poll_dev+0x197/0x1a0
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.719987] bnxt_poll+0x0/0xd0 [bnxt_en] exceeded budget in poll
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720025] Modules linked in: netconsole configfs unix_diag binfmt_misc cpufreq_conservative cpufreq_powersave cpufreq_userspace intel_rapl skx_edac edac_core x
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720030]  cryptd nvme_core bnxt_en i2c_i801 usbcore i2c_smbus scsi_mod usb_common
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720032] CPU: 40 PID: 1 Comm: systemd Not tainted 4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720033] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.2.11 06/14/2019
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720035]  0000000000000000 ffffffff971353d4 ffffba54001ffa90 0000000000000000
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720037]  ffffffff96e7a83b ffff9ce5df960060 ffffba54001ffae8 ffff9cd38548d788
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720038]  0000000000000001 ffff9d15ad848d68 ffff9d15dcdbac80 ffffffff96e7a8bf
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720039] Call Trace:
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720044]  [<ffffffff971353d4>] ? dump_stack+0x5c/0x78
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720046]  [<ffffffff96e7a83b>] ? __warn+0xcb/0xf0
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720047]  [<ffffffff96e7a8bf>] ? warn_slowpath_fmt+0x5f/0x80
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720049]  [<ffffffffc023ffdf>] ? bnxt_poll+0x7f/0xd0 [bnxt_en]
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720051]  [<ffffffffc023ff60>] ? bnxt_poll_work+0x520/0x520 [bnxt_en]
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720053]  [<ffffffff97333af7>] ? netpoll_poll_dev+0x197/0x1a0
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720054]  [<ffffffff97333c05>] ? netpoll_send_skb_on_dev+0x105/0x270
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720055]  [<ffffffff9733405c>] ? netpoll_send_udp+0x2ec/0x450
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720059]  [<ffffffffc0425bb5>] ? write_msg+0xb5/0xf0 [netconsole]
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720063]  [<ffffffff96ed2081>] ? call_console_drivers.isra.18.constprop.25+0xf1/0x100
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720065]  [<ffffffff96ed2890>] ? console_unlock+0x240/0x610
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720067]  [<ffffffff96ed2f76>] ? vprintk_emit+0x316/0x4d0
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720070]  [<ffffffff96f81e83>] ? printk_emit+0x42/0x5e
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720073]  [<ffffffff9713ec8b>] ? simple_strtoull+0x3b/0x70
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720074]  [<ffffffff96ed3244>] ? devkmsg_write+0x114/0x170
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720077]  [<ffffffff9700b1cb>] ? do_iter_readv_writev+0xbb/0x140
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720079]  [<ffffffff9700c75e>] ? do_readv_writev+0x19e/0x240
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720081]  [<ffffffff9700cab6>] ? do_writev+0x66/0x110
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720084]  [<ffffffff96e03b7d>] ? do_syscall_64+0x8d/0x100
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720087]  [<ffffffff9741c3ce>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.720088] ---[ end trace d841b92a717ad872 ]---
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922923] ------------[ cut here ]------------
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922929] WARNING: CPU: 16 PID: 1 at /build/linux-sdMcHj/linux-4.9.189/net/core/netpoll.c:373 netpoll_send_skb_on_dev+0x26b/0x270
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922935] netpoll_send_skb_on_dev(): enp59s0f0 enabled interrupts in poll (bnxt_start_xmit+0x0/0xb90 [bnxt_en])
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922979] Modules linked in: netconsole configfs unix_diag binfmt_misc cpufreq_conservative cpufreq_powersave cpufreq_userspace intel_rapl skx_edac edac_core x
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922984]  cryptd nvme_core bnxt_en i2c_i801 usbcore i2c_smbus scsi_mod usb_common
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922990] CPU: 16 PID: 1 Comm: systemd Tainted: G        W       4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922990] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.2.11 06/14/2019
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922994]  0000000000000000 ffffffff971353d4 ffffba54001ffad0 0000000000000000
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922996]  ffffffff96e7a83b ffff9ce5df960000 ffffba54001ffb28 000000000000004f
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922998]  ffff9ce5e29408c0 ffff9d15ad848d68 ffff9cd38548d780 ffffffff96e7a8bf
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.922999] Call Trace:
Feb 03 11:15:43 ganeti3002 nc.openbsd[14771]: [5616097.923005]  [<ffffffff971353d4>] ? dump_stack+0x5c/0x78

+1, there where icinga errors as early as 11:15:

[2020-02-03 11:15:57] SERVICE ALERT: cp3057;Webrequests Varnishkafka log producer;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
jcrespo renamed this task from cp3057 network down to cp3057 crash (was: network down).Mon, Feb 3, 12:17 PM
jcrespo updated the task description. (Show Details)
ema moved this task from Triage to Hardware on the Traffic board.Tue, Feb 4, 1:50 PM
RobH moved this task from Backlog to Break/Fix on the ops-esams board.Fri, Feb 21, 4:30 PM