Page MenuHomePhabricator

cp3033 unreacheable since 2018-07-15 11:47:31
Closed, DuplicatePublic

Description

cp3033 is unreachable via the production interface since 2018-07-15 11:47:31, mgmt interface is reachable and the console doesn't show nothing out of the ordinary, after logging, dmesg log shows NIC issues

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2018, 11:26 AM
root@cp3033:/var/log# ethtool -i eth0
driver: bnx2x
version: 1.712.30-0
firmware-version: FFV7.10.17 bc 7.10.11
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@cp3033:/var/log# ethtool eth0
Settings for eth0:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseT/Full
	                        10000baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: No
	Advertised link modes:  10000baseT/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Speed: Unknown!
	Duplex: Unknown! (255)
	Port: FIBRE
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: off
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x00000000 (0)

	Link detected: no
[10415964.660782] ------------[ cut here ]------------
[10415964.660790] WARNING: CPU: 13 PID: 34222 at /srv/kernel/linux/net/sched/sch_generic.c:316 dev_watchdog+0x226/0x230
[10415964.660793] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 6 timed out
[10415964.660793] Modules linked in: cdc_ether usbnet mii joydev hid_generic usbhid hid cpuid binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_conservative cpufreq_powersave cpufreq_userspace xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo 8021q garp mrp stp llc tcp_bbr sch_fq intel_rapl sb_edac ipmi_watchdog edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 ttm drm_kms_helper kvm dcdbas irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul iTCO_vendor_support evdev drm ghash_clmulni_intel pcspkr i2c_algo_bit mei_me lpc_ich mei shpchp mfd_core wmi button ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci aesni_intel aes_x86_64 glue_helper lrw ehci_pci
[10415964.660847]  gf128mul bnx2x ablk_helper ptp ehci_hcd cryptd libata pps_core mdio libcrc32c usbcore crc32c_generic scsi_mod usb_common crc32c_intel
[10415964.660860] CPU: 13 PID: 34222 Comm: cache-worker Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.82-1~wmf1
[10415964.660861] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 08/28/2014
[10415964.660863]  0000000000000000 ffffffffa67305e5 ffff8fe9bf183e38 0000000000000000
[10415964.660865]  ffffffffa6479184 0000000000000006 ffff8fe9bf183e90 ffff8fc9b136c000
[10415964.660868]  000000000000000d ffff8fc9b1377100 000000000000005b ffffffffa64791ff
[10415964.660871] Call Trace:
[10415964.660872]  <IRQ>
[10415964.660878]  [<ffffffffa67305e5>] ? dump_stack+0x5c/0x77
[10415964.660882]  [<ffffffffa6479184>] ? __warn+0xc4/0xe0
[10415964.660884]  [<ffffffffa64791ff>] ? warn_slowpath_fmt+0x5f/0x80
[10415964.660888]  [<ffffffffa696e476>] ? tcp_retransmit_timer+0x286/0x890
[10415964.660891]  [<ffffffffa69369a6>] ? dev_watchdog+0x226/0x230
[10415964.660893]  [<ffffffffa6936780>] ? dev_deactivate_queue.constprop.27+0x60/0x60
[10415964.660898]  [<ffffffffa64e85b2>] ? call_timer_fn+0x32/0x130
[10415964.660899]  [<ffffffffa64e9385>] ? run_timer_softirq+0x1e5/0x440
[10415964.660902]  [<ffffffffa67398a4>] ? timerqueue_add+0x54/0xa0
[10415964.660904]  [<ffffffffa64ea808>] ? enqueue_hrtimer+0x38/0x90
[10415964.660909]  [<ffffffffa6a1617c>] ? __do_softirq+0x10c/0x2a2
[10415964.660911]  [<ffffffffa647f4b8>] ? irq_exit+0x98/0xa0
[10415964.660913]  [<ffffffffa6a15c14>] ? smp_apic_timer_interrupt+0x44/0x50
[10415964.660915]  [<ffffffffa6a14496>] ? apic_timer_interrupt+0x96/0xa0
[10415964.660916]  <EOI>
[10415964.660920]  [<ffffffffa64c5bb3>] ? native_queued_spin_lock_slowpath+0x113/0x190
[10415964.660922]  [<ffffffffa6a1245d>] ? _raw_spin_lock+0x1d/0x20
[10415964.660924]  [<ffffffffa64fb018>] ? futex_wake+0xc8/0x170
[10415964.660926]  [<ffffffffa64fd149>] ? do_futex+0x2d9/0xb40
[10415964.660930]  [<ffffffffa64257d9>] ? __switch_to+0x2c9/0x730
[10415964.660932]  [<ffffffffa64fda33>] ? SyS_futex+0x83/0x180
[10415964.660936]  [<ffffffffa6a0dd52>] ? schedule+0x32/0x80
[10415964.660939]  [<ffffffffa6403bd3>] ? do_syscall_64+0x93/0x1a0
[10415964.660941]  [<ffffffffa6a126b8>] ? entry_SYSCALL_64_after_swapgs+0x42/0xb0
[10415964.660942] ---[ end trace 17a2f2dfd85d5ced ]---

Mentioned in SAL (#wikimedia-operations) [2018-07-16T11:38:43Z] <vgutierrez> power cycle cp3033 - T199677

Vgutierrez moved this task from Triage to Hardware on the Traffic board.Jul 16 2018, 11:51 AM

After a power cycle the server it's behaving properly. Since it was already depooled I'm not repooling it

Vgutierrez triaged this task as Medium priority.Jul 16 2018, 1:56 PM

That sounds like a hang in the NIC, but I doubt we have any useful hardware diagnostics/logging on that level.

Dzahn added a subscriber: Dzahn.Apr 24 2019, 12:39 AM

The host also shows that power supplies are not redundant.. which had a comment linking to T177403 -> T177228.

And support has expired (https://netbox.wikimedia.org/dcim/devices/831/)

Should we rather create a decom ticket for it?