Page MenuHomePhabricator

cr4-ulsfo rebooted unexpectedly
Closed, ResolvedPublic

Description

cr4-ulsfo rebooted today, unexpectedly, at 23:19:34 UTC (= 19:20 PST in the paste below).

System booted: 2019-04-16 23:19:34 UTC (00:05:42 ago)

19:20 <+icinga-wm> PROBLEM - Host cr4-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
19:20 <+icinga-wm> PROBLEM - Host cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
19:20 <+icinga-wm> PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%

19:21 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 59, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

19:23 < mutante> volans: looks like there was an issue with the link between 3 and 4 before  https://phabricator.wikimedia.org/T196030

19:24 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0  https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

19:24 <+icinga-wm> RECOVERY - Host cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.46 ms

19:25 <+icinga-wm> PROBLEM - PyBal BGP sessions are established on lvs4007 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops

19:25 < volans> System booted: 2019-04-16 23:19:34 UTC (00:05:42 ago)

19:25 <+icinga-wm> RECOVERY - Host cr4-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.40 ms
19:25 <+icinga-wm> RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms

19:26 <+icinga-wm> RECOVERY - PyBal BGP sessions are established on lvs4007 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops

Event Timeline

Apr 16 23:20:49  cr4-ulsfo kernel: spin lock 0xfffff80012ce73c0 (turnstile lock) held by 0xfffff8000941d560 (tid 100012) too long
Apr 16 23:20:49  cr4-ulsfo kernel: panic: spin lock held too long
Apr 16 23:20:49  cr4-ulsfo kernel: cpuid = 0
Apr 16 23:20:49  cr4-ulsfo kernel: Uptime: 204d2h14m5s
Apr 16 23:20:49  cr4-ulsfo kernel: Dumping 1237 out of 16341 MB:..2%..11%..21%..32%..41%..51%..61%..72%..81%..91%
Apr 16 23:20:49  cr4-ulsfo kernel: Dump complete
Apr 16 23:20:49  cr4-ulsfo kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Apr 16 23:20:49  cr4-ulsfo kernel: Rebooting...

And it generated a 1.2G core dump.

I opened high severity JTAC case 2019-0416-0819 and provided them all the informations, will update here when I hear back from them.

About the impact, we can see a small dip of inbound packets to the LVS:
https://grafana.wikimedia.org/d/000000343/load-balancers?orgId=1&panelId=11&fullscreen&from=1555455629368&to=1555457808525

All failover and fail back happened as expected.

ayounsi triaged this task as Medium priority.Apr 17 2019, 12:12 AM

I have checked the core and the information, we did not find any PR related to this, please give us a few days to analyze the core.

After checking the information.
We have created a PR 1433009 and engineering will analyze and will update us on the findings.

After checking the core the engineering team has an update on what happened
“The thread that is holding the lock seem to have corrupted stack and is holding the lock for a very long time. Other threads starved for this lock beyond threshold and hence the crash.”
They are checking more information on this and will update us on their findings..

Still no news, asked to escalate the case.

According to engineering there is no much information that can be provided from the crash as the issue thread do not have any information and is blank.
This is was not reproducible in the lab so the reason for the issue was untraceable.
Please let me know if we saw any new crashes or issues on the box, so that we can investigate with more information.

The issue didn't happen since. Not much we can do. Closing the task.