Page MenuHomePhabricator

Document path forward and Retire remaining non-Kafka Logstash inputs
Closed, ResolvedPublic

Event Timeline

According to the logstash input type distributions graph we're down to elastisearch via gelf for non-kafka inputs.

Screen Shot 2021-09-08 at 12.50.16 PM.png (377×793 px, 33 KB)

There is some background and discussion about this migration in T225125. The TLDR from my understanding is that the implementation of json formatted elasticsearch logs over syslog has been prepared, but is currently switched off as a dependency is blocked on upgrading to ES7.

I'm tempted to try shimming these using an rsyslog listener that emulates gelf and routes these logs to the kafka logging pipeline until the longer-term/upgraded elastic config is in place.

I'm tempted to try shimming these using an rsyslog listener that emulates gelf and routes these logs to the kafka logging pipeline until the longer-term/upgraded elastic config is in place.

Spent a chunk of time experimenting with this yesterday in deployment-prep, and unfortunately I don't think rsyslog specifically will do the trick.

The current logging configuration in elasticsearch is using logstash-gelf[1], and ships gelf formatted logs over udp directly to the logstash lvs. Part of the gelf protocol is compression and chunking of logs, and while I was able to ingest the gelf udp via rsyslog I have not had any success decompressing/parsing them. In theory logstash-gelf supports tcp transport which disables compression (and newer logstash-gelf versions even support kafka) but in my testing switching to tcp resulted in no logs arriving at all.

So I think it's time to look for alternatives. One alternative that looks promising at first glance is logagent. This is an apache 2 licensed log shipping agent from sematext, which supports GELF input[2] and output to kafka[3] (among others). In theory we could run this as a daemon on the elasic hosts, and configure elasticsearch to output udp gelf to the logagent on localhost, which would relay directly to kafka or the local syslog. Still need testing and validation on this, but appears to be an option so far.

[1] https://github.com/mp911de/logstash-gelf
[2] https://sematext.com/docs/logagent/input-plugin-gelf/
[3] https://sematext.com/docs/logagent/output-plugin-kafka/

Change 720110 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] wip: logagent: puppet module sketch

https://gerrit.wikimedia.org/r/720110

So far so good testing logagent. Confirmed that it can indeed ingest/parse udp GELF logs from our elasticsearch logstash-gelf config and output them json formatted to stdout. By wrapping this config in a systemd unit we should be able to pick up these logs with rsyslog and send them onward to kafka logging.

I've put together an initial sketch of the puppetization, part of which raises the question -- what is appropriate way to install npm packages like logagent and graygelf on production hosts?

After exploring the NPM approach a bit on https://gerrit.wikimedia.org/r/c/operations/puppet/+/720110/ it's clear that we would be better off to look for an alternate tool written in another language with less convoluted dependencies, and which is easier to audit and maintain in the long term.

An alternate approach that comes to mind is deploying logstash instances to GELF shipping hosts locally in a minimal configuration as a GELF to syslog agent. It sounds odd, because at face value transitioning from a central logstash cluster supporting GELF input to local logstash instances with GELF input doesn't seem like much of an improvement. But architecturally it would benefit us significantly in that we could retire the non-kafka logstash cluster, retire the associated LVS balancers, retire the elk5 configs and stop sending udp logs across the network. I'll try prototyping a minimal logstash agent config and see what that could look like.

Change 721345 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP

https://gerrit.wikimedia.org/r/721345

herron renamed this task from Document path forward for how to Retire all non-Kafka Logstash inputs to Document path forward and Retire remaining non-Kafka Logstash inputs.Oct 1 2021, 4:05 PM
herron moved this task from FY2021/2022-Q1 to FY2021/2022-Q2 on the SRE Observability board.
herron changed the task status from Open to In Progress.Oct 7 2021, 5:19 PM
herron triaged this task as Medium priority.
herron moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.

Change 721345 merged by Herron:

[operations/puppet@production] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP

https://gerrit.wikimedia.org/r/721345

Change 721364 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] add logstash gelf relay to elastic1049

https://gerrit.wikimedia.org/r/721364

Change 721364 merged by Herron:

[operations/puppet@production] add logstash gelf relay to elastic1049

https://gerrit.wikimedia.org/r/721364

Mentioned in SAL (#wikimedia-operations) [2021-11-04T17:47:29Z] <ryankemper> T288620 [Elastic] Rebooting elastic1049.eqiad.wmnet to uptake new gelf settings change

Change 736859 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] role::elasticsearch::cirrus: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/736859

Change 736865 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: switch monitoring API to port 9675

https://gerrit.wikimedia.org/r/736865

Change 736865 merged by Herron:

[operations/puppet@production] logstash: switch monitoring API to port 9675

https://gerrit.wikimedia.org/r/736865

Change 736872 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash_exporter: add service notify to defaults file

https://gerrit.wikimedia.org/r/736872

Change 736872 merged by Herron:

[operations/puppet@production] logstash_exporter: add service notify to defaults file

https://gerrit.wikimedia.org/r/736872

Change 736859 merged by Herron:

[operations/puppet@production] role::elasticsearch::cirrus: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/736859

Change 739324 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/739324

Change 739325 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] role::elasticsearch::relforge: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/739325

Change 740191 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash::gelf::input: remove hardcoded tags

https://gerrit.wikimedia.org/r/740191

Change 739324 merged by Herron:

[operations/puppet@production] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/739324

Change 739325 merged by Herron:

[operations/puppet@production] role::elasticsearch::relforge: ship ES logs via gelf_relay

https://gerrit.wikimedia.org/r/739325

Change 740191 merged by Herron:

[operations/puppet@production] logstash::input::gelf: remove hardcoded tags

https://gerrit.wikimedia.org/r/740191

lmata raised the priority of this task from Medium to High.Dec 2 2021, 4:21 PM

Change 743257 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] striker: send logs to logstash pipeline via local rsyslog

https://gerrit.wikimedia.org/r/743257

Change 743257 abandoned by Herron:

[operations/puppet@production] striker: send logs to logstash pipeline via local rsyslog

Reason:

not necessary

https://gerrit.wikimedia.org/r/743257

Change 743261 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] striker: switch cloudweb dev to cee logging handler

https://gerrit.wikimedia.org/r/743261

Change 743261 merged by Herron:

[operations/puppet@production] striker: switch cloudweb dev to cee logging handler

https://gerrit.wikimedia.org/r/743261

No logs have arrived over deprecated logstash inputs in the past 4 days. Boldly resolving this!

Change 720110 abandoned by Herron:

[operations/puppet@production] wip: logagent: puppet module sketch

Reason:

with another route, see bug

https://gerrit.wikimedia.org/r/720110