Page MenuHomePhabricator

logstash1012 lock up caused central logging stuck
Closed, InvalidPublic

Description

Looks like logstash1012 locked up earlier today, a powercycle brought it back to life.

# ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jan-08-2019 | 21:27:32 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Feb-11-2019 | 11:50:47 | Fan Redundancy   | Fan                         | Fully Redundant
3   | Apr-09-2019 | 11:50:02 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
4   | Apr-09-2019 | 11:50:02 | Chipset Info     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data2 code = F2h ; OEM Event Data3 code = 17h
5   | Apr-09-2019 | 11:50:02 | Err Reg Pointer  | OEM Reserved                | OEM Event Offset = 00h
6   | Apr-09-2019 | 11:50:02 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 14h
7   | Apr-09-2019 | 11:50:02 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
8   | Apr-09-2019 | 11:50:02 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
9   | Apr-09-2019 | 11:50:02 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data2 code = 00h
10  | Apr-09-2019 | 11:50:02 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
11  | Apr-09-2019 | 11:50:02 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Eh
12  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
13  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 04h
14  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
15  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
16  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
17  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 04h
18  | Apr-09-2019 | 11:50:03 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh

Event Timeline

Additionally it looks like syslog traffic towards central logs (wezen/lithium) dropped at around the same time, which is unexpected as the two destinations (central syslog + kafka) should be independent.

fgiunchedi renamed this task from logstash1012 lock up to logstash1012 lock up caused central logging stuck.Apr 9 2019, 1:49 PM

For exact reasons still unknown it looks like traffic to lithium/wezen also stopped around the same time logstash1012 went offline:

2019-04-09-154930_597x348_scrot.png (348×597 px, 25 KB)
2019-04-09-154938_601x342_scrot.png (342×601 px, 26 KB)

Bouncing rsyslog on lithium yielded very little results, however restarting rsyslog on wezen (after getting debug information, as per https://wikitech.wikimedia.org/wiki/Rsyslog) resulted in syslog tls traffic starting flowing again.

The strace on wezen showed:

4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
4664  recvfrom(245, 0x7f1aa3d64e90, 313, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

and sure enough, from lsof that fd was logstash1012's (although unclear if the connection is from the lockup or after the reboot)

rsyslogd 4652 root  245u  IPv4          137305187        0t0        TCP wezen.codfw.wmnet:syslog-tls->logstash1012.eqiad.wmnet:37198 (ESTABLISHED)

My finger is pointed towards a gnutls failure mode / bug, which wezen/lithium being jessie systems have version 3.3.8, compared to 3.5.8 in stretch.

colewhite lowered the priority of this task from High to Medium.