rsyslog output modules (fwd, kafka) failures should not affect each other
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jun 27 2019, 11:12 AM

Description

Last message on logstash was at 2019-06-27T01:09:29 and the next one 2019-06-27T06:35:13 (going with @timestamp from logstash, however the dt field for the lagged messages is correct (i.e around 1.10)

Last 5xx:

@timestamp	      	2019-06-27T01:09:41
dt	      	2019-06-27T01:09:29

Then a single message at

@timestamp	      	2019-06-27T02:45:58
 dt	      	2019-06-27T01:09:32

Then resumed at

@timestamp	      	2019-06-27T06:35:13
dt	      	2019-06-27T01:10:22

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)

Starting with https://gerrit.wikimedia.org/r/c/operations/puppet/+/520012

Details

Subject	Repo	Branch	Lines +/-
hieradata: enable rsyslog queues for kafka in codfw	operations/puppet	production	+1 -0
hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad	operations/puppet	production	+3 -0
hieradata: enable rsyslog kafka queues in ulsfo	operations/puppet	production	+2 -0
profile: look up rsyslog queue_size in scope	operations/puppet	production	+15 -15
profile: remove rollout flag for rsyslog queues	operations/puppet	production	+0 -14
profile: add queues to rsyslog kafka output	operations/puppet	production	+54 -1
base: remove obsolete enable_rsyslog_exporter	operations/puppet	production	+3 -16
hieradata: enable remote syslog queues in codfw	operations/puppet	production	+1 -0
hieradata: enable rsyslog queues in esams/eqiad	operations/puppet	production	+2 -0
hieradata: enable rsyslog queues in ulsfo/eqsin	operations/puppet	production	+3 -0
base: add remote syslog queues	operations/puppet	production	+21 -4
Add rsyslog delivery actions failure alerts	operations/puppet	production	+21 -0
rsyslog: use named actions for central syslog hosts	operations/puppet	production	+16 -10

Related Objects

Mentioned In: T219634: Alert suspended rsyslog actions
Mentioned Here: T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts

Event Timeline

fgiunchedi created this task.Jun 27 2019, 11:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2019, 11:12 AM

fgiunchedi updated the task description. (Show Details)Jun 27 2019, 11:20 AM

So kafkatee-webrequest wrote the logs as expected around that time frame, at least in journald

Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:

On the central syslog servers it looks like a reoccurence of T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts which might explain the lagging

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)

herron subscribed.Jun 27 2019, 2:19 PM

Change 520012 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

gerritbot added a project: Patch-For-Review.Jul 1 2019, 1:45 PM

fgiunchedi updated the task description. (Show Details)Jul 2 2019, 12:14 PM

Change 520012 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

Maintenance_bot removed a project: Patch-For-Review.Jul 3 2019, 8:10 AM

fgiunchedi updated the task description. (Show Details)Jul 3 2019, 1:24 PM

Change 520709 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

gerritbot added a project: Patch-For-Review.Jul 4 2019, 9:09 AM

fgiunchedi moved this task from Inbox to In progress on the observability board.Jul 8 2019, 12:48 PM

Change 520709 merged by Filippo Giunchedi:
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

Maintenance_bot removed a project: Patch-For-Review.Jul 17 2019, 9:10 AM

fgiunchedi added a project: User-fgiunchedi.Jul 18 2019, 11:09 AM

fgiunchedi mentioned this in T219634: Alert suspended rsyslog actions.Jul 22 2019, 2:22 PM

fgiunchedi moved this task from In progress to Inbox on the observability board.Oct 28 2019, 2:15 PM

fgiunchedi renamed this task from webrequest 5xx data in logstash stopped at ~1:10 2019/06/26 and catched up at ~6:30 2019/06/27 to rsyslog output modules (fwd, kafka) failures should not affect each other.Jul 6 2020, 11:44 AM

fgiunchedi moved this task from Inbox to Backlog on the observability board.

Change 626186 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

gerritbot added a project: Patch-For-Review.Sep 9 2020, 5:10 PM

Change 626186 merged by Filippo Giunchedi:
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

Change 627232 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

Change 627232 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Sep 15 2020, 7:45 AM

Change 627418 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627418 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627816 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821

Change 627865 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add queues to rsyslog kafka output

https://gerrit.wikimedia.org/r/627865

Change 627816 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 merged by Filippo Giunchedi:
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821

Change 627865 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add queues to rsyslog kafka output

https://gerrit.wikimedia.org/r/627865

fgiunchedi moved this task from Backlog to In progress on the observability board.Sep 29 2020, 12:26 PM

Change 631438 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: remove rollout flag for rsyslog queues

https://gerrit.wikimedia.org/r/631438

Change 631439 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog kafka queues in ulsfo

https://gerrit.wikimedia.org/r/631439

Change 631446 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: look up rsyslog queue_size in scope

https://gerrit.wikimedia.org/r/631446

Change 631438 merged by Filippo Giunchedi:
[operations/puppet@production] profile: remove rollout flag for rsyslog queues

https://gerrit.wikimedia.org/r/631438

Change 631446 merged by Filippo Giunchedi:
[operations/puppet@production] profile: look up rsyslog queue_size in scope

https://gerrit.wikimedia.org/r/631446

Change 631439 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog kafka queues in ulsfo

https://gerrit.wikimedia.org/r/631439

Change 632512 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad

https://gerrit.wikimedia.org/r/632512

Change 632512 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad

https://gerrit.wikimedia.org/r/632512

Change 632655 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in codfw

https://gerrit.wikimedia.org/r/632655

Change 632655 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in codfw