Page MenuHomePhabricator

rsyslog output modules (fwd, kafka) failures should not affect each other
Closed, ResolvedPublic

Description

Last message on logstash was at 2019-06-27T01:09:29 and the next one 2019-06-27T06:35:13 (going with @timestamp from logstash, however the dt field for the lagged messages is correct (i.e around 1.10)

Last 5xx:

@timestamp	      	2019-06-27T01:09:41
dt	      	2019-06-27T01:09:29

Then a single message at

@timestamp	      	2019-06-27T02:45:58
 dt	      	2019-06-27T01:09:32

Then resumed at

@timestamp	      	2019-06-27T06:35:13
dt	      	2019-06-27T01:10:22

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

  • rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
  • rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)

    Starting with https://gerrit.wikimedia.org/r/c/operations/puppet/+/520012

Event Timeline

So kafkatee-webrequest wrote the logs as expected around that time frame, at least in journald

Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:45 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:09:51 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:03 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:09 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:14 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:10:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:17 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:28 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:34 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:
Jun 27 01:11:35 weblog1001 webrequest[5207]: @cee:

On the central syslog servers it looks like a reoccurence of T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts which might explain the lagging

Other than the fact that rsyslog didn't self recover, there's also these points we'll need to address:

  • rsyslog omfwd to central syslog failed, but omkafka delivery failed too, and they should fail independently instead
  • rsyslog omfwd actions should be given an explicit names so we have metrics for them (e.g. action_failed / action_processed)

Change 520012 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

Change 520012 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: use named actions for central syslog hosts

https://gerrit.wikimedia.org/r/520012

Change 520709 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

Change 520709 merged by Filippo Giunchedi:
[operations/puppet@production] Add rsyslog delivery actions failure alerts

https://gerrit.wikimedia.org/r/520709

fgiunchedi renamed this task from webrequest 5xx data in logstash stopped at ~1:10 2019/06/26 and catched up at ~6:30 2019/06/27 to rsyslog output modules (fwd, kafka) failures should not affect each other.Jul 6 2020, 11:44 AM
fgiunchedi moved this task from Inbox to Backlog on the observability board.

Change 626186 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

Change 626186 merged by Filippo Giunchedi:
[operations/puppet@production] base: add remote syslog queues

https://gerrit.wikimedia.org/r/626186

Change 627232 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

Change 627232 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in ulsfo/eqsin

https://gerrit.wikimedia.org/r/627232

Change 627418 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627418 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues in esams/eqiad

https://gerrit.wikimedia.org/r/627418

Change 627816 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821

Change 627865 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add queues to rsyslog kafka output

https://gerrit.wikimedia.org/r/627865

Change 627816 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable remote syslog queues in codfw

https://gerrit.wikimedia.org/r/627816

Change 627821 merged by Filippo Giunchedi:
[operations/puppet@production] base: remove obsolete enable_rsyslog_exporter

https://gerrit.wikimedia.org/r/627821

Change 627865 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add queues to rsyslog kafka output

https://gerrit.wikimedia.org/r/627865

Change 631438 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: remove rollout flag for rsyslog queues

https://gerrit.wikimedia.org/r/631438

Change 631439 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog kafka queues in ulsfo

https://gerrit.wikimedia.org/r/631439

Change 631446 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: look up rsyslog queue_size in scope

https://gerrit.wikimedia.org/r/631446

Change 631438 merged by Filippo Giunchedi:
[operations/puppet@production] profile: remove rollout flag for rsyslog queues

https://gerrit.wikimedia.org/r/631438

Change 631446 merged by Filippo Giunchedi:
[operations/puppet@production] profile: look up rsyslog queue_size in scope

https://gerrit.wikimedia.org/r/631446

Change 631439 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog kafka queues in ulsfo

https://gerrit.wikimedia.org/r/631439

Change 632512 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad

https://gerrit.wikimedia.org/r/632512

Change 632512 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad

https://gerrit.wikimedia.org/r/632512

Change 632655 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in codfw

https://gerrit.wikimedia.org/r/632655

Change 632655 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable rsyslog queues for kafka in codfw

https://gerrit.wikimedia.org/r/632655

fgiunchedi claimed this task.

This is complete now, we have local queues for all rsyslog outputs