Page MenuHomePhabricator

Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally.
Closed, ResolvedPublic

Description

We should totally look into getting elasticsearch logs into logstash for persistence (you can't analyze a failed/failing nodes logs after it failed). Especially because we have RAID0 on elastic nodes in production making it not just possible but likely to lose this data.

Event Timeline

chasemp raised the priority of this task from to Medium.
chasemp updated the task description. (Show Details)

Are we talking only abut the search ES cluster? Not the ES cluster backing logstash? I would not want a feed back loop auto generating traffic on ES.

Seems that we have some integration with logstash directly at the logging framework level. I wonder if this is a good idea for a heavy load service. Asynchronous logging in log4j is possible, but will add memory load if throughput to logstash is maxed out. And we will probably want to loose messages under load instead of slowing ES down. I have not seen the use of lumberjack or similar yet (but have not been looking very hard), but it might be a better idea to queue logging messages on disk. Has there already been some similar reflection / experience at WMF?

MediaWiki uses the syslog protocol over UDP port 10514 to ship log events to Logstash to avoid blocking. Most node services use GELF over UDP port 12201. In general I would recommend UDP at the transport layer for sending anything to Logstash.

I was looking for other Java apps at WMF and found Cassandra, which seems to use logback LogstashSocketAppender. I see there is a GELF logger for log4j, so that might be usable. I have no experience with it, need to test.

Elastic main logs are not very verbose :

-rw-r--r-- 1 elasticsearch elasticsearch 604K Feb  6 13:10 production-search-eqiad.log
-rw-r--r-- 1 elasticsearch elasticsearch  60M Feb  6 06:25 production-search-eqiad.log.1
-rw-r--r-- 1 elasticsearch elasticsearch 3.0M Jan 10 06:25 production-search-eqiad.log.2
-rw-r--r-- 1 elasticsearch elasticsearch 7.1M Feb  5 06:25 production-search-eqiad.log.2.gz
-rw-r--r-- 1 elasticsearch elasticsearch 6.0M Jan  9 06:25 production-search-eqiad.log.3
-rw-r--r-- 1 elasticsearch elasticsearch 5.3M Feb  4 06:26 production-search-eqiad.log.3.gz
-rw-r--r-- 1 elasticsearch elasticsearch 5.2M Jan  8 06:25 production-search-eqiad.log.4
-rw-r--r-- 1 elasticsearch elasticsearch 227K Feb  3 06:26 production-search-eqiad.log.4.gz
-rw-r--r-- 1 elasticsearch elasticsearch 3.3M Jan  7 06:25 production-search-eqiad.log.5
-rw-r--r-- 1 elasticsearch elasticsearch  52K Feb  2 06:26 production-search-eqiad.log.5.gz
-rw-r--r-- 1 elasticsearch elasticsearch 3.4M Jan  6 06:25 production-search-eqiad.log.6
-rw-r--r-- 1 elasticsearch elasticsearch  40K Feb  1 06:25 production-search-eqiad.log.6.gz
-rw-r--r-- 1 elasticsearch elasticsearch 6.7M Jan  5 06:26 production-search-eqiad.log.7
-rw-r--r-- 1 elasticsearch elasticsearch  72K Jan 31 06:25 production-search-eqiad.log.7.gz

(looks like we need to delete these old ucompressed logs?)

I don't have strong opinion on the method, ideally I'd prefer not to include extra deps if possible but I'm ok with it if it's the prefered method.
In order to ship data to logstash do we have to add logstash support at the service level or can we use the regular log files and ship them to logstash with tools like filebeat?

Using an external shipper is reasonable. There's an aging task to select a standard shipper (T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine) that could use investigation and discussion with TechOps.

I'm not a big fan of serializing logs to disk to re-parse them right away. I'd much prefer to send logs directly from log4j to logstash. But that would require an additional external dependency.

I would use the following criteria (in order of priority):

  1. isolation from application to logging backend: there should be no impact on Elasticsearch if logstash slows down, is unavailable, ...
  2. all structured logging information sent to logstash in a structured format: this includes things like MDC
  3. changes to information available in logs should not require reconfiguration of logstash shipping

x) extra dependencies should not be added: not sure where to put that in term of priority, personally I don't have much of an issue in adding dependencies as long as we have a good way to manage them

Change 269100 had a related patch set uploaded (by Gehel):
Ship Elasticsearch logs to logstash

https://gerrit.wikimedia.org/r/269100

In order to ship data to logstash do we have to add logstash support at the service level or can we use the regular log files and ship them to logstash with tools like filebeat?

Looking through operations/puppet repository, I see no reference to filebeat. Seems that we need to resolve T97297 first if we want to use an external log shipper (which might not be a bad idea).

filebeat replaces logstash-forwarder (mentionned in T97297), unfortunately it uses also the go runtime...
So I'm not sure what to suggest...
You could maybe continue with the gelf4j approach?

Continuing with the gelfj approach.

I understood a few things discussing with @Ottomata:

  • the repository for jar packages is Archiva
  • .jar are usually deployed via git-fat / trebuchet
  • in the context of gelfj and this task, it might make sense to create a .deb package instead (not clear to me yet)
  • elastic search logging will be configured from puppet, so management of dependencies (gelfj.jar) should be done in the same place

Oh! If you are looking specifically for logstash +gelf, we have this already.

http://apt.wikimedia.org/wikimedia/pool/main/l/logstash-gelf/

logstash-gelf package exists and plays the same role as gelfj, so let's use it. Package for trusty does not seem to exist, so next task is to build it.

Change 269656 had a related patch set uploaded (by Gehel):
Archiva now uses HTTPS

https://gerrit.wikimedia.org/r/269656

Change 269656 merged by Ottomata:
Rebuild logstash-gelf for Ubuntu Trusty

https://gerrit.wikimedia.org/r/269656

This is deployed on labs (deployment-elastic0[5678].deployment-prep.eqiad.wmflabs). Results are visible on logstash-beta (look for type:gelf).

I'd still like to have feedback from @bd808 (or anyone else who knows what they are doing) before deploying to prod to make sure logs are categorized as expected (are there any info that we really want to have? Or really do not want to have...).

deployment-elastic06 seems to tag it's events with type "gelf" while deployment-elastic0[78] use type "logstash-gelf". Ideally we would configure the Elasticsearch side of this communication to consistently tag with a type of "elasticsearch" or something similar for ease of grouping. The 'type' in Logstash/Kibana is taken from the 'facility' in the original GELF packet. We can always add rules to filter-gelf.conf on the Logstash side to fix up various things as well.

So the Gotcha section on the logstash page no longer applies? I'll check why the values are not aligned (there is no reason that they should be different) and update it with "elasticsearch".

So the Gotcha section on the logstash page no longer applies? I'll check why the values are not aligned (there is no reason that they should be different) and update it with "elasticsearch".

On the application side you should set 'facility' rather than 'type' or '_type'. The Logstash rules will copy that value over the 'type' value before storing the records in Elasticsearch. I'll put some clarification on that page.

Code is ready to merge and deploy, but we'll wait until T109101 to do only one cluster restart.

Change 269100 merged by Gehel:
Ship Elasticsearch logs to logstash

https://gerrit.wikimedia.org/r/269100

Stashbot subscribed.

Mentioned in SAL [2016-02-29T12:16:43Z] <gehel> elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-02-29T14:16:40Z] <gehel> elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-02-29T18:01:34Z] <gehel> elastic2002.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-02-29T19:22:20Z] <gehel> elastic2003.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-02-29T20:21:34Z] <gehel> elastic2004.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T09:46:29Z] <gehel> elastic2016.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T10:42:38Z] <gehel> elastic2017.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T11:40:13Z] <gehel> elastic2018.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T12:37:30Z] <gehel> elastic2019.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T13:53:43Z] <gehel> elastic2020.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T14:34:03Z] <gehel> elastic2021.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T15:45:35Z] <gehel> elastic2022.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T16:40:43Z] <gehel> elastic2023.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T17:52:14Z] <gehel> elastic2024.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-01T21:29:36Z] <gehel> elastic1001.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T08:48:30Z] <gehel> elastic1003.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T10:16:00Z] <gehel> elastic1004.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T13:23:33Z] <gehel> elastic1005.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T14:32:25Z] <gehel> elastic1006.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T15:15:34Z] <gehel> elastic1007.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T16:34:08Z] <gehel> elastic1008.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T17:20:58Z] <gehel> elastic1009.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T18:58:46Z] <gehel> elastic1010.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-02T20:35:46Z] <gehel> elastic1011.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T10:07:48Z] <gehel> elastic1022.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T12:02:06Z] <gehel> elastic1023.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T12:48:56Z] <gehel> elastic1024.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T13:47:18Z] <gehel> elastic1025.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T14:34:38Z] <gehel> elastic1026.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T15:42:43Z] <gehel> elastic1027.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T16:25:50Z] <gehel> elastic1028.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T17:09:55Z] <gehel> elastic1029.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T18:41:56Z] <gehel> elastic1030.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)

Mentioned in SAL [2016-03-03T19:44:23Z] <gehel> elastic1031.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)