IRC log of icinga alerts, etc:
03:16 <+icinga-wm> PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) timed out before a response was received: /_info (test for /_info) timed out before a response was received 03:19 <+icinga-wm> PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 03:19 <+icinga-wm> PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 03:19 <+icinga-wm> RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy 03:19 <+icinga-wm> RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4026 is OK: HTTP OK: HTTP/1.1 200 OK - 458 bytes in 0.157 second response time 03:19 <+icinga-wm> PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] 03:20 <+icinga-wm> PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map 03:20 <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] 03:21 < bblack> looking 03:22 <+icinga-wm> PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds 03:23 <+icinga-wm> RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 908 bytes in 5.451 second response time 03:24 < bblack> !log depooled cp4026 03:25 <+icinga-wm> RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 895 bytes in 7.330 second response time
dmesg on cp4026:
[Thu Sep 21 03:17:43 2017] ------------[ cut here ]------------ [Thu Sep 21 03:17:43 2017] WARNING: CPU: 35 PID: 0 at /build/linux-OExn4L/linux-4.9.30/net/sched/sch_generic.c:316 dev_watchdog+0x220/0x230 [Thu Sep 21 03:17:43 2017] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 0 timed out [Thu Sep 21 03:17:43 2017] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_powersave cpufreq_conservative xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo 8021q garp mrp stp llc intel_rapl sb_edac edac_core x86_pkg_temp_thermal mgag200 iTCO_wdt intel_powerclamp iTCO_vendor_support coretemp mxm_wmi evdev dcdbas ttm drm_kms_helper kvm drm i2c_algo_bit irqbypass crct10dif_pclmul crc32_pclmul mei_me lpc_ich ghash_clmulni_intel pcspkr shpchp mei mfd_core ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci bnx2x libahci aesni_intel aes_x86_64 ehci_pci glue_helper lrw gf128mul libata ehci_hcd ablk_helper ptp cryptd pps_core mdio usbcore [Thu Sep 21 03:17:43 2017] libcrc32c scsi_mod crc32c_generic usb_common crc32c_intel [Thu Sep 21 03:17:43 2017] CPU: 35 PID: 0 Comm: swapper/35 Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1 [Thu Sep 21 03:17:43 2017] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017 [Thu Sep 21 03:17:43 2017] 0000000000000000 ffffffffa6f29ca5 ffff8bedff443e38 0000000000000000 [Thu Sep 21 03:17:43 2017] ffffffffa6c77964 0000000000000000 ffff8bedff443e90 ffff8bcd62b60000 [Thu Sep 21 03:17:43 2017] 0000000000000023 ffff8bcd62b6f100 000000000000005b ffffffffa6c779df [Thu Sep 21 03:17:43 2017] Call Trace: [Thu Sep 21 03:17:43 2017] <IRQ> [Thu Sep 21 03:17:43 2017] [<ffffffffa6f29ca5>] ? dump_stack+0x5c/0x77 [Thu Sep 21 03:17:43 2017] [<ffffffffa6c77964>] ? __warn+0xc4/0xe0 [Thu Sep 21 03:17:43 2017] [<ffffffffa6c779df>] ? warn_slowpath_fmt+0x5f/0x80 [Thu Sep 21 03:17:43 2017] [<ffffffffa712cef0>] ? dev_watchdog+0x220/0x230 [Thu Sep 21 03:17:43 2017] [<ffffffffa712ccd0>] ? dev_deactivate_queue.constprop.27+0x60/0x60 [Thu Sep 21 03:17:43 2017] [<ffffffffa6ce6330>] ? call_timer_fn+0x30/0x130 [Thu Sep 21 03:17:43 2017] [<ffffffffa6ce791c>] ? run_timer_softirq+0x1dc/0x440 [Thu Sep 21 03:17:43 2017] [<ffffffffa6f32ee4>] ? timerqueue_add+0x54/0xa0 [Thu Sep 21 03:17:43 2017] [<ffffffffa6ce82e8>] ? enqueue_hrtimer+0x38/0x80 [Thu Sep 21 03:17:43 2017] [<ffffffffa720a2e6>] ? __do_softirq+0x106/0x292 [Thu Sep 21 03:17:43 2017] [<ffffffffa6c7dc08>] ? irq_exit+0x98/0xa0 [Thu Sep 21 03:17:43 2017] [<ffffffffa720a0ee>] ? smp_apic_timer_interrupt+0x3e/0x50 [Thu Sep 21 03:17:43 2017] [<ffffffffa7209402>] ? apic_timer_interrupt+0x82/0x90 [Thu Sep 21 03:17:43 2017] <EOI> [Thu Sep 21 03:17:43 2017] [<ffffffffa70ccfc3>] ? cpuidle_enter_state+0x113/0x260 [Thu Sep 21 03:17:43 2017] [<ffffffffa6cbc0be>] ? cpu_startup_entry+0x17e/0x260 [Thu Sep 21 03:17:43 2017] [<ffffffffa6c4845d>] ? start_secondary+0x14d/0x190 [Thu Sep 21 03:17:43 2017] ---[ end trace f2c43ffdc31e3ace ]--- [Thu Sep 21 03:17:45 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:17:45 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:17:52 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:17:52 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:18:36 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(12909) != txdata->tx_pkt_cons(11166) [Thu Sep 21 03:18:39 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:18:39 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:18:47 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(1929) != txdata->tx_pkt_cons(1916) [Thu Sep 21 03:18:48 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:18:48 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:21:30 2017] TCP: too many orphaned sockets [Thu Sep 21 03:23:24 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:23:24 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:23:31 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:23:31 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [Thu Sep 21 03:23:55 2017] bnx2x 0000:05:00.0 eth0: using MSI-X IRQs: sp 35 fp[0] 37 ... fp[11] 48 [Thu Sep 21 03:23:55 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
It's possible this is driven by a hardware bug (or switch issue?), but there's also a good chance this is some kind of rare failure of the $numa_networking setup on these nodes?