Page MenuHomePhabricator

rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down
Open, MediumPublic

Description

While investigating a different issue with @akosiaris, we run into the following error messages in different hosts:

Dec 12 08:36:20 mwdebug1002 rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down [v8.1901.0 try https://www.rsyslog.com/e/2422 ]
Dec 12 08:36:20 mwdebug1002 rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down [v8.1901.0 try https://www.rsyslog.com/e/2422 ]
Dec 12 08:36:20 mwdebug1002 rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down [v8.1901.0 try https://www.rsyslog.com/e/2422 ]
Dec 12 08:36:20 mwdebug1002 rsyslogd: action 'action-8-omkafka' suspended (module 'omkafka'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Dec 12 08:36:21 mwdebug1002 rsyslogd: action 'action-8-omkafka' resumed (module 'omkafka') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]

We run a search for the occurrence of such messages within one hour across the flee for Dec 12 09:00 - 09:59 UTC:

5ms-be2038.codfw.wmnet,mw[2143,2175,2214-2215,2230,2237,2244,2258].codfw.wmnet,scb[2001-2002,2006].codfw.wmnet
10ms-be2019.codfw.wmnet,ms-be1032.eqiad.wmnet,mw[2208,2285].codfw.wmnet,mw1340.eqiad.wmnet
8ms-be2047.codfw.wmnet,ms-be1031.eqiad.wmnet,mw[2194,2211,2269].codfw.wmnet,mw1279.eqiad.wmnet
6ganeti2005.codfw.wmnet,mw[2182,2277].codfw.wmnet,webperf2002.codfw.wmnet
16conf2002.codfw.wmnet,ores1002.eqiad.wmnet
12mw2233.codfw.wmnet
4an-worker[1088,1090].eqiad.wmnet,ms-be[2017,2037,2050].codfw.wmnet,ms-be1026.eqiad.wmnet,mw[2163,2170,2177-2178,2190,2192,2217,2222,2242,2253,2261,2276].codfw.wmnet,mw[1281,1330].eqiad.wmnet
1db[2089,2114].codfw.wmnet,db1113.eqiad.wmnet,elastic2035.codfw.wmnet,es1014.eqiad.wmnet,ganeti2007.codfw.wmnet,mw[2165,2168-2169,2196,2199,2205,2216,2227,2232,2252,2270,2284,2287].codfw.wmnet,mw[1245,1284,1347].eqiad.wmnet,releases1001.eqiad.wmnet,restbase2017.codfw.wmnet,scb[1003-1004].eqiad.wmnet,thumbor2002.codfw.wmnet,wtp2005.codfw.wmnet
3aqs1009.eqiad.wmnet,db2121.codfw.wmnet,dbprov1001.eqiad.wmnet,mc1032.eqiad.wmnet,ms-be1028.eqiad.wmnet,mw2286.codfw.wmnet,mw1288.eqiad.wmnet,wtp2002.codfw.wmnet
7mc1030.eqiad.wmnet,ms-be1030.eqiad.wmnet,mw[2138,2141,2203,2221].codfw.wmnet,mwdebug2002.codfw.wmnet,scb[2003-2004].codfw.wmnet
15mc1031.eqiad.wmnet,ms-be1039.eqiad.wmnet,mw[2144,2146].codfw.wmnet
11elastic2047.codfw.wmnet,mw[2236,2255,2272].codfw.wmnet
2db1140.eqiad.wmnet,mw[2135,2140,2147,2166,2206].codfw.wmnet,mw[1232,1343].eqiad.wmnet,registry2001.codfw.wmnet,scb1001.eqiad.wmnet
9ms-be1046.eqiad.wmnet,scb2005.codfw.wmnet
56etcd1002.eqiad.wmnet
13weblog1001.eqiad.wmnet,wtp2012.codfw.wmnet
63conf2003.codfw.wmnet
0<snip 1289 hosts>

raw data in P9863

Event Timeline

jijiki created this task.Dec 12 2019, 11:38 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2019, 11:38 AM
akosiaris triaged this task as High priority.Dec 12 2019, 3:04 PM

I am gonna triage as high because there is the fear that we are currently losing logs because of this. This was uncovered during investigation of T214734

This error message might be a false positive. I think librdkafka spits it out (is rsyslog using librdkafka?) when there is no data to send for a period of time.

rsyslog does indeed use librdkafka so it might be that! re: losing logs AFAICT that's not happening at the moment in the sense that there aren't discards reported by rsyslog-exporter (also dashboard at https://grafana.wikimedia.org/d/000000596/rsyslog)

mw2233:~$ curl -s localhost:9105/metrics  | grep -i -e discard -e kafka
rsyslog_action_failed{action="action-10-omkafka"} 0
rsyslog_action_failed{action="action-18-omkafka"} 0
rsyslog_action_failed{action="action-9-omkafka"} 0
rsyslog_action_failed{action="mmjsonparse_kafka"} 0
rsyslog_action_failed{action="mmjsonparse_kafka_raw"} 0
rsyslog_action_processed{action="action-10-omkafka"} 25
rsyslog_action_processed{action="action-18-omkafka"} 52
rsyslog_action_processed{action="action-9-omkafka"} 0
rsyslog_action_processed{action="mmjsonparse_kafka"} 25
rsyslog_action_processed{action="mmjsonparse_kafka_raw"} 25
rsyslog_action_resumed{action="action-10-omkafka"} 0
rsyslog_action_resumed{action="action-18-omkafka"} 0
rsyslog_action_resumed{action="action-9-omkafka"} 0
rsyslog_action_resumed{action="mmjsonparse_kafka"} 0
rsyslog_action_resumed{action="mmjsonparse_kafka_raw"} 0
rsyslog_action_suspended{action="action-10-omkafka"} 0
rsyslog_action_suspended{action="action-18-omkafka"} 0
rsyslog_action_suspended{action="action-9-omkafka"} 0
rsyslog_action_suspended{action="mmjsonparse_kafka"} 0
rsyslog_action_suspended{action="mmjsonparse_kafka_raw"} 0
rsyslog_action_suspended_duration{action="action-10-omkafka"} 0
rsyslog_action_suspended_duration{action="action-18-omkafka"} 0
rsyslog_action_suspended_duration{action="action-9-omkafka"} 0
rsyslog_action_suspended_duration{action="mmjsonparse_kafka"} 0
rsyslog_action_suspended_duration{action="mmjsonparse_kafka_raw"} 0
rsyslog_input_submitted{input="omkafka"} 77
# HELP rsyslog_queue_discarded_full messages discarded due to queue being full
# TYPE rsyslog_queue_discarded_full counter
rsyslog_queue_discarded_full{queue="main Q"} 0
rsyslog_queue_discarded_full{queue="udp_localhost_to_kafka"} 0
# HELP rsyslog_queue_discarded_not_full messages discarded when queue not full
# TYPE rsyslog_queue_discarded_not_full counter
rsyslog_queue_discarded_not_full{queue="main Q"} 0
rsyslog_queue_discarded_not_full{queue="udp_localhost_to_kafka"} 0
rsyslog_queue_enqueued{queue="udp_localhost_to_kafka"} 52
rsyslog_queue_full{queue="udp_localhost_to_kafka"} 0
rsyslog_queue_max_size{queue="udp_localhost_to_kafka"} 4
rsyslog_queue_size{queue="udp_localhost_to_kafka"} 0

This is from a first investigation at least, I'll take a closer look too!

jcrespo moved this task from Backlog to Acknowledged on the Operations board.Dec 12 2019, 6:24 PM
akosiaris lowered the priority of this task from High to Medium.Dec 13 2019, 8:43 AM

Lowering priority per the comments above. Plus, there is a new theory in T214734.