Page MenuHomePhabricator

upload@ulsfo strange ethernet / power / switch issues, etc...
Closed, ResolvedPublic

Description

IRC log of icinga alerts, etc:

03:16 <+icinga-wm> PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) timed out before a response 
                   was received: /_info (test for /_info) timed out before a response was received
03:19 <+icinga-wm> PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
03:19 <+icinga-wm> PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
03:19 <+icinga-wm> RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy
03:19 <+icinga-wm> RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4026 is OK: HTTP OK: HTTP/1.1 200 OK - 458 bytes in 0.157 second response time
03:19 <+icinga-wm> PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
03:20 <+icinga-wm> PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
03:20 <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
03:21 < bblack> looking
03:22 <+icinga-wm> PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
03:23 <+icinga-wm> RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 908 bytes in 5.451 second response time
03:24 < bblack> !log depooled cp4026
03:25 <+icinga-wm> RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 895 bytes in 7.330 second response time

dmesg on cp4026:

[Thu Sep 21 03:17:43 2017] ------------[ cut here ]------------
[Thu Sep 21 03:17:43 2017] WARNING: CPU: 35 PID: 0 at /build/linux-OExn4L/linux-4.9.30/net/sched/sch_generic.c:316 dev_watchdog+0x220/0x230
[Thu Sep 21 03:17:43 2017] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 0 timed out
[Thu Sep 21 03:17:43 2017] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_powersave cpufreq_conservative xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo 8021q garp mrp stp llc intel_rapl sb_edac edac_core x86_pkg_temp_thermal mgag200 iTCO_wdt intel_powerclamp iTCO_vendor_support coretemp mxm_wmi evdev dcdbas ttm drm_kms_helper kvm drm i2c_algo_bit irqbypass crct10dif_pclmul crc32_pclmul mei_me lpc_ich ghash_clmulni_intel pcspkr shpchp mei mfd_core ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci bnx2x libahci aesni_intel aes_x86_64 ehci_pci glue_helper lrw gf128mul libata ehci_hcd ablk_helper ptp cryptd pps_core mdio usbcore
[Thu Sep 21 03:17:43 2017]  libcrc32c scsi_mod crc32c_generic usb_common crc32c_intel
[Thu Sep 21 03:17:43 2017] CPU: 35 PID: 0 Comm: swapper/35 Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1
[Thu Sep 21 03:17:43 2017] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
[Thu Sep 21 03:17:43 2017]  0000000000000000 ffffffffa6f29ca5 ffff8bedff443e38 0000000000000000
[Thu Sep 21 03:17:43 2017]  ffffffffa6c77964 0000000000000000 ffff8bedff443e90 ffff8bcd62b60000
[Thu Sep 21 03:17:43 2017]  0000000000000023 ffff8bcd62b6f100 000000000000005b ffffffffa6c779df
[Thu Sep 21 03:17:43 2017] Call Trace:
[Thu Sep 21 03:17:43 2017]  <IRQ> 
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6f29ca5>] ? dump_stack+0x5c/0x77
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6c77964>] ? __warn+0xc4/0xe0
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6c779df>] ? warn_slowpath_fmt+0x5f/0x80
[Thu Sep 21 03:17:43 2017]  [<ffffffffa712cef0>] ? dev_watchdog+0x220/0x230
[Thu Sep 21 03:17:43 2017]  [<ffffffffa712ccd0>] ? dev_deactivate_queue.constprop.27+0x60/0x60
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6ce6330>] ? call_timer_fn+0x30/0x130
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6ce791c>] ? run_timer_softirq+0x1dc/0x440
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6f32ee4>] ? timerqueue_add+0x54/0xa0
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6ce82e8>] ? enqueue_hrtimer+0x38/0x80
[Thu Sep 21 03:17:43 2017]  [<ffffffffa720a2e6>] ? __do_softirq+0x106/0x292
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6c7dc08>] ? irq_exit+0x98/0xa0
[Thu Sep 21 03:17:43 2017]  [<ffffffffa720a0ee>] ? smp_apic_timer_interrupt+0x3e/0x50
[Thu Sep 21 03:17:43 2017]  [<ffffffffa7209402>] ? apic_timer_interrupt+0x82/0x90
[Thu Sep 21 03:17:43 2017]  <EOI> 
[Thu Sep 21 03:17:43 2017]  [<ffffffffa70ccfc3>] ? cpuidle_enter_state+0x113/0x260
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6cbc0be>] ? cpu_startup_entry+0x17e/0x260
[Thu Sep 21 03:17:43 2017]  [<ffffffffa6c4845d>] ? start_secondary+0x14d/0x190
[Thu Sep 21 03:17:43 2017] ---[ end trace f2c43ffdc31e3ace ]---
[Thu Sep 21 03:17:45 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:17:45 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:17:52 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:17:52 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:18:36 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(12909) != txdata->tx_pkt_cons(11166)
[Thu Sep 21 03:18:39 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:18:39 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:18:47 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(1929) != txdata->tx_pkt_cons(1916)
[Thu Sep 21 03:18:48 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:18:48 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:21:30 2017] TCP: too many orphaned sockets
[Thu Sep 21 03:23:24 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:23:24 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:23:31 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:23:31 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Sep 21 03:23:55 2017] bnx2x 0000:05:00.0 eth0: using MSI-X  IRQs: sp 35  fp[0] 37 ... fp[11] 48
[Thu Sep 21 03:23:55 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit

It's possible this is driven by a hardware bug (or switch issue?), but there's also a good chance this is some kind of rare failure of the $numa_networking setup on these nodes?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Actually, seeing the same on several cp402x. Depooling ulsfo, maybe switch issue?

Change 379464 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] depool ulsfo

https://gerrit.wikimedia.org/r/379464

Change 379464 merged by BBlack:
[operations/dns@master] depool ulsfo

https://gerrit.wikimedia.org/r/379464

So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at about the same time. cp4021 was the lone exception. cp402[78] in text@ulsfo unaffected. Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the netdev watchdog -> link issues -> meltdown. It's possible this was caused by actual external traffic burst?

Also notable is that during the same timing-coincidence window, we had some more codfw RIPE atlas probe alarms, and then shortly after the ulsfo depool, the ulsfo ripe atlas plus cp4007 went host-down so far...

asw-ulsfo has some other alerts going on, aside from the expected link loss to various flapping or supposedly-down hosts, e.g.:

Sep 21 03:57:51  asw-ulsfo alarmd[1458]: Alarm set: PS JAVA color=RED, class=CHASSIS, reason=FPC 2 PEM 1 is not powered
Sep 21 03:57:51  asw-ulsfo craftd[1459]:  Major alarm set, FPC 2 PEM 1 is not powered

... and now we've lost the cr1-eqiad <-> cr1-codfw link ... ?

cr1-eqiad xe-4/2/0: down -> Core: cr1-codfw:xe-5/2/1

Recoveries of whatever the hell is happening in ulsfo:

04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
04:28 <+icinga-wm> RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms
04:30 <+icinga-wm> RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms
BBlack renamed this task from cp4026 strange ethernet issue to upload@ulsfo strange ethernet / power / switch issues, etc....Sep 21 2017, 4:34 AM

Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the netdev watchdog -> link issues -> meltdown. It's possible this was caused by actual external traffic burst?

Yes, this is what lvs4002 looked like shortly before your depool: https://grafana.wikimedia.org/dashboard/db/ipvs-backend-connections?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-server=2&from=1505821559731&to=1505974302176&panelId=1&fullscreen

ema triaged this task as High priority.Sep 21 2017, 7:37 AM

So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at about the same time. cp4021 was the lone exception. cp402[78] in text@ulsfo unaffected. Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the netdev watchdog -> link issues -> meltdown. It's possible this was caused by actual external traffic burst?

I've confirmed that text@ulsfo has not been affected.

The following upload@ulsfo hosts show the exact same kernel oops mentioned in the task description: cp4022, cp4023, cp4025, and cp4026 (all of which have numa_networking: isolate). The timing is also interesting, they all happened between 03:17:01 (cp4023) and 03:18:00 (cp4025). Notice that the ramp up in ulsfo traffic happened much later than that, at ~03:45.

cp4021, as mentioned before in the task, is fine, and so are cp4005 and cp4007. The latter has been restarted on Thu Sep 21 04:28:27 though.

Then, unrelated to this specific issue, there's another oops in dmesg for the following hosts (with date of occurrence):
cp4013 Tue Jul 18 22:15:06 2017
cp4006 Tue Jul 25 12:17:32 2017
cp4014 Tue Aug 1 03:23:00 2017
cp4015 Tue Aug 1 08:55:07 2017

[Tue Jul 18 22:15:06 2017] ------------[ cut here ]------------
[Tue Jul 18 22:15:06 2017] WARNING: CPU: 2 PID: 0 at /tmp/buildd/linux-4.9.25/net/ipv4/tcp_input.c:2240 tcp_mark_head_lost+0x1f4/0x250
[Tue Jul 18 22:15:06 2017] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_conservative cpufreq_powersave xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo 8021q garp mrp stp llc iTCO_wdt iTCO_vendor_support dcdbas mgag200 ttm intel_cstate(+) intel_rapl_perf drm_kms_helper evdev joydev pcspkr drm i2c_algo_bit mei_me lpc_ich mfd_core mei ipmi_si wmi acpi_pad button shpchp tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod hid_generic usbhid hid sg sd_mod aesni_intel aes_x86_64 ehci_pci glue_helper ahci lrw bnx2x gf128mul ehci_hcd libahci ablk_helper ptp cryptd libata pps_core mdio usbcore libcrc32c scsi_mod crc32c_generic usb_common crc32c_intel
[Tue Jul 18 22:15:06 2017] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.25-1~bpo8+3
[Tue Jul 18 22:15:06 2017] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 1.6.0 03/07/2013
[Tue Jul 18 22:15:06 2017]  0000000000000000 ffffffffb6929be5 0000000000000000 0000000000000000
[Tue Jul 18 22:15:06 2017]  ffffffffb6677924 ffff9bb77f5477c0 ffff9bb77f5477c0 0000000000000001
[Tue Jul 18 22:15:06 2017]  ffff9bb79fa43b94 000000000000002c 0000000000000000 ffffffffb6b52c14
[Tue Jul 18 22:15:06 2017] Call Trace:
[Tue Jul 18 22:15:06 2017]  <IRQ> 
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6929be5>] ? dump_stack+0x5c/0x77
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6677924>] ? __warn+0xc4/0xe0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b52c14>] ? tcp_mark_head_lost+0x1f4/0x250
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b577e6>] ? tcp_fastretrans_alert+0x746/0xa80
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b58bd4>] ? tcp_ack+0x974/0x1610
[Tue Jul 18 22:15:06 2017]  [<ffffffffb66bb34e>] ? __wake_up_common+0x4e/0x90
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b5ac95>] ? tcp_rcv_established+0x1b5/0x6e0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b65777>] ? tcp_v4_do_rcv+0x127/0x200
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b66f1a>] ? tcp_v4_rcv+0x96a/0x9e0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b4093b>] ? ip_local_deliver_finish+0x8b/0x1c0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b40c0b>] ? ip_local_deliver+0x6b/0xf0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b665a0>] ? tcp_v4_early_demux+0x130/0x140
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b40642>] ? ip_rcv_finish+0x172/0x3e0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b40f11>] ? ip_rcv+0x281/0x3b0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6bef0ae>] ? packet_rcv+0x4e/0x440
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6affd0e>] ? __netif_receive_skb_core+0x2be/0xa40
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b01572>] ? process_backlog+0x92/0x140
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6b00ce5>] ? net_rx_action+0x245/0x380
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6c08126>] ? __do_softirq+0x106/0x292
[Tue Jul 18 22:15:06 2017]  [<ffffffffb667dba8>] ? irq_exit+0x98/0xa0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6c07e6f>] ? do_IRQ+0x4f/0xd0
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6c05f82>] ? common_interrupt+0x82/0x82
[Tue Jul 18 22:15:06 2017]  <EOI> 
[Tue Jul 18 22:15:06 2017]  [<ffffffffb6acb330>] ? cpuidle_enter_state+0x110/0x260
[Tue Jul 18 22:15:06 2017]  [<ffffffffb66bc05e>] ? cpu_startup_entry+0x17e/0x260
[Tue Jul 18 22:15:06 2017]  [<ffffffffb664843d>] ? start_secondary+0x14d/0x190
[Tue Jul 18 22:15:06 2017] ---[ end trace 49938df40d86111b ]---

The exact call traces do differ, but they all match up to tcp_v4_rcv.

Change 379498 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Repool text@ulsfo

https://gerrit.wikimedia.org/r/379498

Change 379498 merged by Ema:
[operations/dns@master] Repool text@ulsfo

https://gerrit.wikimedia.org/r/379498

Recoveries of whatever the hell is happening in ulsfo:

04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
04:28 <+icinga-wm> RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms
04:30 <+icinga-wm> RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms

asw-ulsfo losing PEM 1, plus ripe-atlas dying is indicative of one of our two power feeds failing; cp4007 has a failed secondary PSU and is why it was also affected:

root@cp4007:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: Critical [PS Redundancy = Critical] | 'Inlet Temp'=20.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=41.00;8.00:70.00;3.00:75.00 'Temp'=63.00 'Temp'=64.00

(T109903 tracks the lack of monitoring for this type of issue :/)

Confirmed from UnitedLayer email:

Assad Kermanshahi, Sep 20, 21:13 PDT

Dear WikimediaFoundation,

This is to inform you that your PDUb in cab 1.22 at 200 Paul lost power. We are at your cab and trying to restore the power and will
update you as soon as we do.

Thank you.
UnitedLayer
Assad Kermanshahi, Sep 20, 21:39 PDT

Dear Rob,

We have restored the power in your cab 1.22 b-side and all of your devises that were connected to that side up and running.

We will look into this issue and will update you tomorrow regarding the root cause of this power failure.

Thank you.
Unitedlayer

But according to the logs, the power event happened at 03:57, way after the oops'es (03:17) and traffic rampup (03:45). If it wasn't for that time difference, my theory would have been that UL moved fibers while they were fixing the PDU issue. I wonder if someone was messing with our rack between 03:17-03:57, although that communication above suggests they weren't...

BBlack claimed this task.

Nothing really to do here, except remember it if new power issues arise with the new hosts...