Page MenuHomePhabricator

Investigate log shipping methods and standardize on them (logstash)
Closed, ResolvedPublic

Event Timeline

As part of the logging infrastructure design document I have been looking into rsyslog's ability to write to Kafka via omkafka. The module "works" in jessie for a basic use case, though even a basic stress test is quickly able to uncover some issues. Indeed omkafka received substantial work to make it more reliable and properly use librdkafka, as documented in https://github.com/rsyslog/rsyslog/issues/1052.

Unfortunately said work happened recently in rsyslog (starting in 8.28.0) which isn't available in jessie-backports. A quick try to backport a recent version of rsyslog from stretch-backports to jessie-backports didn't yield a whole of joy with a lot of dependencies to backport as well.
If we were to write to Kafka from rsyslog it'd mean using rsyslog from stretch-backports on stretch only. In other words leaving out ~300 production hosts as of today (270 jessie + 35 trusty).

As part of the logging infrastructure design document I have been looking into rsyslog's ability to write to Kafka via omkafka. The module "works" in jessie for a basic use case, though even a basic stress test is quickly able to uncover some issues. Indeed omkafka received substantial work to make it more reliable and properly use librdkafka, as documented in https://github.com/rsyslog/rsyslog/issues/1052.

Unfortunately said work happened recently in rsyslog (starting in 8.28.0) which isn't available in jessie-backports. A quick try to backport a recent version of rsyslog from stretch-backports to jessie-backports didn't yield a whole of joy with a lot of dependencies to backport as well.

To be more specific, building rsyslog 8.37.0 on jessie (with backports enabled)

The following packages have unmet dependencies:
 pbuilder-satisfydepends-dummy : Depends: debhelper (>= 10.3) but 10.2.5~bpo8+1 is to be installed.
                                 Depends: libmongoc-dev which is a virtual package.
                                 Depends: libbson-dev which is a virtual package.
                                 Depends: librelp-dev (>= 1.2.16) but it is not going to be installed.
                                 Depends: liblognorm-dev (>= 2.0.3) but it is not going to be installed.
                                 Depends: libfastjson-dev (>= 0.99.8) but it is not going to be installed.
                                 Depends: libczmq-dev (>= 3.0.2) which is a virtual package.
 systemd : Depends: libsystemd0 (= 215-17+deb8u7) but 230-7~bpo8+2 is to be installed.

My first attempt was to target an official jessie-backports upload. Starting from the debhelper 10.3 dependency, that is due to dh_missing --fail-missing invocations, which I removed as a test and loosened the dh dependency.

librelp with the loosened dh dependency fails to build due to dh-autoreconf running twice (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880840 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=887482 the latter "fixed" in dh-autoreconf 16, the former in dh 11)

At this point the rabbit hole seemed deep enough, though a second attempt with a non-official backport (i.e. one for WMF, thus we can drop some dependencies we don't use) seems to look better: I removed libmongoc-dev libbson-dev librelp-dev libczmq-dev and their respective binary packages.
With libfastjson (0.99.8-1~bpo8+1) and liblognorm (2.0.3-1~bpo8+1) built and made available to cowbuilder then rsyslog (8.37.0-2~bpo8+1) was able to be built at least.

I've set up a test rig for rsyslog + omkafka on wmcs logging-jessie01 and logging-stretch01 . With the former being the producer and the latter running kafka.

The configuration used resembles what we'd be likely using if we had unstructured and structured (json) producers to /dev/log (IOW using mmjsonparse to parse json in the syslog message payload). Additionally a test rsyslog_exporter setup is used to gauge/test Prometheus metrics too.

module(
  load="impstats"
  interval="10"
  format="json"
  resetCounters="off"
  ruleset="process_stats"
)

module(load="omprog")

ruleset(name="process_stats") {
  action(
    type="omprog"
    name="to_exporter"
    # https://github.com/rsyslog/rsyslog/issues/1858
    binary="/home/filippo/go/bin/rsyslog_exporter -web.listen-address :9104"
    output="/tmp/rsyslog_exporter.out"
  )
}

module(load="omkafka")

module(load="mmjsonparse")

# Starting from rsyslog 8.33 it is easier than below to generate json from templates
# https://github.com/rsyslog/rsyslog/pull/2475
template(name="syslog_json" type="list" option.json="on") {
  constant(value="{")
  constant(value="\"timestamp\":\"")
  property(name="timereported" dateFormat="rfc3339")
  constant(value="\",\"message\":\"")
  property(name="msg")
  constant(value="\",\"host\":\"")
  property(name="hostname")
  constant(value="\",\"severity\":\"")
  property(name="syslogseverity-text")
  constant(value="\",\"facility\":\"")
  property(name="syslogfacility-text")
  constant(value="\",\"syslog-tag\":\"")
  property(name="syslogtag")
  constant(value="\"}")
}

template(name="json_json" type="list") {
  property(name="$!all-json")
}

action(type="mmjsonparse" cookie="" name="mmjsonparse")
if $parsesuccess == "OK" then {
  action(
    broker=["logging-stretch01.logging.eqiad.wmflabs:9092"]
    type="omkafka"
    name="kafka_json"
    topic="logging.rsyslog"
    template="json_json"
  )
} else if $parsesuccess == "FAIL" then {
  action(
    broker=["logging-stretch01.logging.eqiad.wmflabs:9092"]
    type="omkafka"
    name="kafka_syslog"
    topic="logging.rsyslog"
    template="syslog_json"
  )
}

The smoke test for syslog is crude and aimed at assessing basic reliability and performance, i.e. produce with:
zcat /var/log/auth.log*gz | cut -b 34- | pv -l --rate-limit LIMIT | logger -t perftest-syslog

and consume with:
kafkacat -b localhost -C -t logging.rsyslog -o end -u | pv -l >/dev/null

and finally compare the number of lines produced/consumed by pv.

So far setting LIMIT at 5000 seem to work (i.e. the consumer receives as many messages as the producer sent), with the consumer maxing out at about 2000 messages/s. Bumping producer to 7500 messages/s results in the consumer getting about 80% percent of messages (146k vs 181k) and rsyslog saying librdkafka queue is full (investigation continues)

Sep 14 09:12:12 logging-jessie01 rsyslogd[1954]: omkafka: Failed to produce to topic 'logging.rsyslog' (rd_kafka_producev)partition 0: -184/Local: Queue full  [v8.37.0 try http://www.rsyslog.com/e/2424 ]
Sep 14 09:12:12 logging-jessie01 rsyslogd[1954]: action 'kafka_syslog' suspended (module 'omkafka'), retry 0. There should be messages before this one giving the reason for suspension. [v8.37.0 try http://www.rsyslog.com/e/2007 ]
Sep 14 09:12:13 logging-jessie01 rsyslogd[1954]: action 'kafka_syslog' suspended (module 'omkafka'), next retry is Fri Sep 14 09:12:43 2018, retry nbr 0. There should be messages before this one giving the reason for suspension. [v8.37.0 try http://www.rsyslog.com/e/2007 ]
Sep 14 09:13:31 logging-jessie01 rsyslogd[1954]: action 'kafka_syslog' resumed (module 'omkafka') [v8.37.0 try http://www.rsyslog.com/e/2359 ]

Trying again with a minimal configuration below that should retry and save failed messages.

module(load="impstats"
       interval="10"
       severity="7"
       log.syslog="off"
       log.file="/tmp/rsyslog.stats")

module(load="omkafka")   # lets you send to Kafka

# Starting from rsyslog 8.33 it is easier to generate json from templates
# https://github.com/rsyslog/rsyslog/pull/2475
template(name="json_lines" type="list" option.json="on") {
  constant(value="{")
  constant(value="\"timestamp\":\"")
  property(name="timereported" dateFormat="rfc3339")
  constant(value="\",\"message\":\"")
  property(name="msg")
  constant(value="\",\"host\":\"")
  property(name="hostname")
  constant(value="\",\"severity\":\"")
  property(name="syslogseverity-text")
  constant(value="\",\"facility\":\"")
  property(name="syslogfacility-text")
  constant(value="\",\"syslog-tag\":\"")
  property(name="syslogtag")
  constant(value="\"}")
} 
  
action(
  broker=["logging-stretch01.logging.eqiad.wmflabs:9092"]
  type="omkafka"
  topic="rsyslog"
  template="json_lines"
  resubmitOnFailure="on"
  KeepFailedMessages="on"
  failedMsgFile="rsyslog_kafka_failed"
)

Out of the 181k messages sent about 120k made it to the topic with rsyslog reporting

Sep 14 15:09:48 logging-jessie01 rsyslogd[7405]: omkafka: Failed to produce to topic 'logging.rsyslog' (rd_kafka_producev)partition 0: -184/Local: Queue full  [v8.37.0 try http://www.rsyslog.com/e/2424 ]
Sep 14 15:09:48 logging-jessie01 rsyslogd[7405]: action 'kafka_syslog' suspended (module 'omkafka'), retry 0. There should be messages before this one giving the reason for suspension. [v8.37.0 try http://www.rsyslog.com/e/2007 ]
Sep 14 15:09:49 logging-jessie01 rsyslogd[7405]: action 'kafka_syslog' suspended (module 'omkafka'), next retry is Fri Sep 14 15:10:19 2018, retry nbr 0. There should be messages before this one giving the reason for suspension. [v8.37.0 try http://www.rsyslog.com/e/2007 ]
Sep 14 15:10:31 logging-jessie01 rsyslogd[7405]: action 'kafka_syslog' resumed (module 'omkafka') [v8.37.0 try http://www.rsyslog.com/e/2359 ]

And the internal rsyslog stats reporting this after sending logs has finished

Fri Sep 14 15:10:36 2018: global: origin=dynstats
Fri Sep 14 15:10:36 2018: imuxsock: origin=imuxsock submitted=181982 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri Sep 14 15:10:36 2018: omkafka: origin=omkafka submitted=120957 maxoutqsize=100000 failures=1 topicdynacache.skipped=0 topicdynacache.miss=0 topicdynacache.evicted=0 acked=120953 failures_msg_too_large=0 failures_unknown_topic=0 failures_queue_full=1 failures_unknown_partition=0 failures_other=0 errors_timed_out=0 errors_transport=0 errors_broker_down=0 errors_auth=0 errors_other=0 rtt_avg_usec=0 throttle_avg_msec=0 int_latency_avg_usec=0
Fri Sep 14 15:10:36 2018: kafka_syslog: origin=core.action processed=181982 failed=61024 suspended=1 suspended.duration=30 resumed=0
Fri Sep 14 15:10:36 2018: resource-usage: origin=impstats utime=3884000 stime=3800000 maxrss=204360 minflt=50540 majflt=0 inblock=0 oublock=48 nvcsw=519140 nivcsw=105804 openfiles=15
Fri Sep 14 15:10:36 2018: main Q: origin=core.queue size=0 enqueued=181982 full=0 discarded.full=0 discarded.nf=0 maxqsize=24106

and indeed the failed=61024 for kafka_syslog action appear to be about the missing messages

I've experimented with librdkafka settings and setting queue.buffering.max.ms=50 plus batch.num.messages=1000 I've been able to reach significant produce speeds so that omkafka is no longer the bottleneck.
By producing directly to rsyslogd (i.e. logger --socket /run/systemd/journal/syslog) speed is >= 60k messages/s whereas producing to systemd-journald (i.e. /dev/log) maxes out at about 20k messages/s on a wmcs medium instance.

fgiunchedi claimed this task.

The architecture and gaps review has been carried out as part of the logging infrastructure design document and implementation next steps will be carried out in Q2