Page MenuHomePhabricator

Ship host syslogs to ELK
Open, MediumPublic

Description

Having host syslogs indexed and searchable in ELK would be useful for general troubleshooting, reporting, etc. across the fleet.

We have a logstash syslog listener and filters in place today, along with rsyslog configurations applied to certain host classes which are sending specific log types to logstash via syslog. Network device syslogs are being indexed as well.

It would be great to broaden the config. Ideally, to include the host syslogs that are being aggregated by syslog.(codfw|eqiad).wmnet

A few things to figure out (for starters)

  • Capacity - How much headroom is there for new log sources in the current ELK infrastructure?
  • Retention - How long before these indices should be deleted?
  • Transport - It would be ideal to ship over TLS. Also basic queueing/redelivery would be nice to have, something to help prevent gaps.

Event Timeline

herron triaged this task as Medium priority.May 3 2018, 5:14 PM
herron created this task.

Thanks for kickstarting this! +1, having syslogs in ELK would be very useful indeed. Some partial answers to the things to figure out:

  • Capacity - I chatted with @Gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet wet with multiple indices instead of one single index. Syslog might be a good occasion for that, in this context I'm saying "one index" but it is really one index per day, prefixed e.g. with syslog.
  • Retention - max 90d as per privacy policy. I believe the max now on ELK is 30d. Another advantage of syslog in a different index is being able to apply retention to that index only for cases like this.
  • Transport - looks like logstash can receive syslog-tls already according to https://discuss.elastic.co/t/logstash-input-tcp-with-tls-and-handshake/70126/2 and I believe rsyslog can provide on-disk spooling (disk assisted queues, as rsyslog calls it)
  • Capacity - I chatted with @Gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet wet with multiple indices instead of one single index. Syslog might be a good occasion for that, in this context I'm saying "one index" but it is really one index per day, prefixed e.g. with syslog.

Sounds good to me! Something like logstash-syslog-date should help keep things expecting logstash-* working as-is

I spent some time testing this out in a sandbox and indeed am able to send logs from rsyslog to logstash over TLS using the TCP plugin. FWIW the config that worked in testing (with self-signed certs) was

input {
        tcp {
                port => 16514
                mode => "server"
                ssl_enable => true
                ssl_cert => "/etc/logstash/cert.pem"
                ssl_key => "/etc/logstash/key.pem"
                ssl_verify => false
                add_field => {
                        "transport" => "tcp-tls"
                }
        }
}

Also had luck with RELP over TLS. We have options!

input {
        relp {
                port => 16515
                ssl_enable => true
                ssl_cert => "/etc/logstash/cert.pem"
                ssl_key => "/etc/logstash/key.pem"
                ssl_verify => false
                add_field => {
                        "transport" => "relp-tls"
                }
        }
}

According to the logstash RELP input module doc: "This protocol implements application-level acknowledgements to help protect against message loss. Message acks only function as far as messages being put into the queue for filters; anything lost after that point will not be retransmitted"

In theory this has some benefit over plain TCP and seems worth a shot. Of course there are more dependencies as well, it requires a logstash input plugin on the server, and rsyslog-relp package installed on clients.

Doubtful, but I wonder if RELP would make any difference in T136312?

  • Capacity - I chatted with @Gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet wet with multiple indices instead of one single index. Syslog might be a good occasion for that, in this context I'm saying "one index" but it is really one index per day, prefixed e.g. with syslog.

Sounds good to me! Something like logstash-syslog-date should help keep things expecting logstash-* working as-is

I initially though of going for a different prefix other than logstash but indeed using logstash-syslog-DATE will already DTRT. I can't see any obvious disavantage to that at the moment so +1

I spent some time testing this out in a sandbox and indeed am able to send logs from rsyslog to logstash over TLS using the TCP plugin. FWIW the config that worked in testing (with self-signed certs) was

Good to know that's a viable option!

Also had luck with RELP over TLS. We have options!

input {
        relp {
                port => 16515
                ssl_enable => true
                ssl_cert => "/etc/logstash/cert.pem"
                ssl_key => "/etc/logstash/key.pem"
                ssl_verify => false
                add_field => {
                        "transport" => "relp-tls"
                }
        }
}

According to the logstash RELP input module doc: "This protocol implements application-level acknowledgements to help protect against message loss. Message acks only function as far as messages being put into the queue for filters; anything lost after that point will not be retransmitted"

In theory this has some benefit over plain TCP and seems worth a shot. Of course there are more dependencies as well, it requires a logstash input plugin on the server, and rsyslog-relp package installed on clients.

Agreed that'd be nice to have and we should try it out eventually. Given our requirements (i.e. missed syslogs are not a critical event) and the additional dependencies/complexity on both server and clients I believe syslog-tls is a better bet to get started at least.

Doubtful, but I wonder if RELP would make any difference in T136312?

I haven't checked but yeah it is doubtful I'll help (upstream also fixed the bug apparently in rsyslog's next release)

Given our requirements (i.e. missed syslogs are not a critical event) and the additional dependencies/complexity on both server and clients I believe syslog-tls is a better bet to get started at least.

Sounds like a plan! I'll get started on patches for this.

Change 431830 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add tcp tls input for syslogs

https://gerrit.wikimedia.org/r/431830

Change 431860 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ELK: change elasticsearch index prefix to logstash-syslog for syslog type

https://gerrit.wikimedia.org/r/431860

Great to see this moving! Starting experimenting with multiple indices is a great idea! And syslog is probably sufficiently simple to be a good starting point.

A few questions / comments:

  • We already have syslog messages coming in. We should probably ensure that moving those to different indices is as transparent as possible. My understanding is that kibana will completely abstract this, but I have not tested it.
  • Having overlapping prefix (logstash- and logstash-syslog-) will cause some issues at least with curator scripts, which rely on a prefix to identify the indices to work on. We could use more complex selection in curator, or ensure we don't have overlapping prefixes (I tend to prefer the no overlap version, it's much easier to read a prefix than a regex).
  • We already have syslog messages coming in. We should probably ensure that moving those to different indices is as transparent as possible. My understanding is that kibana will completely abstract this, but I have not tested it.

Afaict yes, as long as we continue matching the currently configured index pattern of logstash-* it will be transparent. It may increase search time slightly since Kibana will have more indices to search across for a given time period, and would add some overhead to ES in terms of state, replication, etc.

  • Having overlapping prefix (logstash- and logstash-syslog-) will cause some issues at least with curator scripts, which rely on a prefix to identify the indices to work on. We could use more complex selection in curator, or ensure we don't have overlapping prefixes (I tend to prefer the no overlap version, it's much easier to read a prefix than a regex).

The approach in https://gerrit.wikimedia.org/r/#/c/431860/3/modules/role/manifests/logstash/collector.pp should address this by requiring just one logstash::output::elasticsearch and handling index name selection in the filter section. This would avoid deploying multiple curator instances with overlapping prefixes.

We could in addition change the default index name prefix from logstash- to something like logstash-misc- to avoid overlap altogether. This might also be helpful down the road if we wanted to apply different curator policies by pattern. In which case we probably also would need a toggle switch for automatic creation of curator configs in logstash::output::elasticsearch.

Change 434719 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] puppet-agent: remove --show_diff from scheduled puppet-run script

https://gerrit.wikimedia.org/r/434719

Change 431860 merged by Herron:
[operations/puppet@production] ELK: change elasticsearch index prefix to logstash-syslog for syslog type

https://gerrit.wikimedia.org/r/431860

The updated logstash-syslog prefix is looking good. Beginning to see results in Kibana with the new prefix:

Screen Shot 2018-05-30 at 10.27.00 AM.png (778×584 px, 66 KB)

Change 431830 merged by Herron:
[operations/puppet@production] logstash: add tcp tls input for syslogs

https://gerrit.wikimedia.org/r/431830

Change 436821 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set exposed puppet cert ownerships to logstash:logstash

https://gerrit.wikimedia.org/r/436821

Change 436821 merged by Herron:
[operations/puppet@production] logstash: set exposed puppet cert ownerships to logstash:logstash

https://gerrit.wikimedia.org/r/436821

Mentioned in SAL (#wikimedia-operations) [2018-06-01T16:13:18Z] <herron> enabled new logstash tcp input with TLS enabled for syslogs on port 16514 T193766

For some reason the new icinga check for this called "Logstash syslog TLS listener on port 16514" is erroring with:

$ /usr/lib/nagios/plugins/check_ssl -H logstash1007.eqiad.wmnet -p 16514
SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed because of handshake problems

But testing with openssl s_client seems to work:

$ openssl s_client -connect logstash1007.eqiad.wmnet:16514 -tls1_2
CONNECTED(00000003)
depth=1 CN = Puppet CA: palladium.eqiad.wmnet
verify return:1
depth=0 CN = logstash1007.eqiad.wmnet
verify return:1
140376343107216:error:1409E0E5:SSL routines:ssl3_write_bytes:ssl handshake failure:s3_pkt.c:659:

I'll disable these checks for now while troubleshooting

Change 436837 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: remove Logstash syslog TLS listener check for troubleshooting

https://gerrit.wikimedia.org/r/436837

Change 436837 merged by Herron:
[operations/puppet@production] icinga: remove Logstash syslog TLS listener check for troubleshooting

https://gerrit.wikimedia.org/r/436837

Change #434719 abandoned by Herron:

[operations/puppet@production] puppet-agent: remove --show_diff from scheduled puppet-run script

Reason:

spring cleaning -- stale patch

https://gerrit.wikimedia.org/r/434719