Page MenuHomePhabricator

rsyslog output modules (fwd, kafka) failures should not affect each other
Open, Needs TriagePublic

Description

Last message on logstash was at 2019-06-27T01:09:29 and the next one 2019-06-27T06:35:13 (going with @timestamp from logstash, however the dt field for the lagged messages is correct (i.e around 1.10)

Last 5xx:

@timestamp	      	2019-06-27T01:09:41
dt	      	2019-06-27T01:09:29

Then a single message at

@timestamp	      	2019-06-27T02:45:58
 dt	      	2019-06-27T01:09:32

Then resumed at

@timestamp	      	2019-06-27T06:35:13
dt	      	2019-06-27T01:10:22

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

  • rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
  • rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)

    Starting with https://gerrit.wikimedia.org/r/c/operations/puppet/+/520012

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2019, 11:12 AM
fgiunchedi updated the task description. (Show Details)Jun 27 2019, 11:20 AM

So kafkatee-webrequest wrote the logs as expected around that time frame, at least in journald

Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:

On the central syslog servers it looks like a reoccurence of T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts which might explain the lagging

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

  • rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
  • rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)
herron added a subscriber: herron.Jun 27 2019, 2:19 PM

Change 520012 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

fgiunchedi updated the task description. (Show Details)Jul 2 2019, 12:14 PM

Change 520012 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

fgiunchedi updated the task description. (Show Details)Jul 3 2019, 1:24 PM

Change 520709 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

fgiunchedi moved this task from Inbox to In progress on the observability board.Jul 8 2019, 12:48 PM

Change 520709 merged by Filippo Giunchedi:
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

fgiunchedi moved this task from In progress to Inbox on the observability board.Oct 28 2019, 2:15 PM
fgiunchedi renamed this task from webrequest 5xx data in logstash stopped at ~1:10 2019/06/26 and catched up at ~6:30 2019/06/27 to rsyslog output modules (fwd, kafka) failures should not affect each other.Jul 6 2020, 11:44 AM
fgiunchedi moved this task from Inbox to Backlog on the observability board.

Change 626186 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

Change 626186 merged by Filippo Giunchedi:
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

Change 627232 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

Change 627232 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Tue, Sep 15, 7:45 AM

Change 627418 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627418 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627816 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821

Change 627865 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add queues to rsyslog kafka output

https://gerrit.wikimedia.org/r/627865

Change 627816 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 merged by Filippo Giunchedi:
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821