We should totally look into getting elasticsearch logs into logstash for persistence (you can't analyze a failed/failing nodes logs after it failed). Especially because we have RAID0 on elastic nodes in production making it not just possible but likely to lose this data.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Gehel | T109089 EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) | |||
Resolved | Gehel | T109101 Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. |
Event Timeline
Are we talking only abut the search ES cluster? Not the ES cluster backing logstash? I would not want a feed back loop auto generating traffic on ES.
Seems that we have some integration with logstash directly at the logging framework level. I wonder if this is a good idea for a heavy load service. Asynchronous logging in log4j is possible, but will add memory load if throughput to logstash is maxed out. And we will probably want to loose messages under load instead of slowing ES down. I have not seen the use of lumberjack or similar yet (but have not been looking very hard), but it might be a better idea to queue logging messages on disk. Has there already been some similar reflection / experience at WMF?
MediaWiki uses the syslog protocol over UDP port 10514 to ship log events to Logstash to avoid blocking. Most node services use GELF over UDP port 12201. In general I would recommend UDP at the transport layer for sending anything to Logstash.
I was looking for other Java apps at WMF and found Cassandra, which seems to use logback LogstashSocketAppender. I see there is a GELF logger for log4j, so that might be usable. I have no experience with it, need to test.
Elastic main logs are not very verbose :
-rw-r--r-- 1 elasticsearch elasticsearch 604K Feb 6 13:10 production-search-eqiad.log -rw-r--r-- 1 elasticsearch elasticsearch 60M Feb 6 06:25 production-search-eqiad.log.1 -rw-r--r-- 1 elasticsearch elasticsearch 3.0M Jan 10 06:25 production-search-eqiad.log.2 -rw-r--r-- 1 elasticsearch elasticsearch 7.1M Feb 5 06:25 production-search-eqiad.log.2.gz -rw-r--r-- 1 elasticsearch elasticsearch 6.0M Jan 9 06:25 production-search-eqiad.log.3 -rw-r--r-- 1 elasticsearch elasticsearch 5.3M Feb 4 06:26 production-search-eqiad.log.3.gz -rw-r--r-- 1 elasticsearch elasticsearch 5.2M Jan 8 06:25 production-search-eqiad.log.4 -rw-r--r-- 1 elasticsearch elasticsearch 227K Feb 3 06:26 production-search-eqiad.log.4.gz -rw-r--r-- 1 elasticsearch elasticsearch 3.3M Jan 7 06:25 production-search-eqiad.log.5 -rw-r--r-- 1 elasticsearch elasticsearch 52K Feb 2 06:26 production-search-eqiad.log.5.gz -rw-r--r-- 1 elasticsearch elasticsearch 3.4M Jan 6 06:25 production-search-eqiad.log.6 -rw-r--r-- 1 elasticsearch elasticsearch 40K Feb 1 06:25 production-search-eqiad.log.6.gz -rw-r--r-- 1 elasticsearch elasticsearch 6.7M Jan 5 06:26 production-search-eqiad.log.7 -rw-r--r-- 1 elasticsearch elasticsearch 72K Jan 31 06:25 production-search-eqiad.log.7.gz
(looks like we need to delete these old ucompressed logs?)
I don't have strong opinion on the method, ideally I'd prefer not to include extra deps if possible but I'm ok with it if it's the prefered method.
In order to ship data to logstash do we have to add logstash support at the service level or can we use the regular log files and ship them to logstash with tools like filebeat?
Using an external shipper is reasonable. There's an aging task to select a standard shipper (T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine) that could use investigation and discussion with TechOps.
I'm not a big fan of serializing logs to disk to re-parse them right away. I'd much prefer to send logs directly from log4j to logstash. But that would require an additional external dependency.
I would use the following criteria (in order of priority):
- isolation from application to logging backend: there should be no impact on Elasticsearch if logstash slows down, is unavailable, ...
- all structured logging information sent to logstash in a structured format: this includes things like MDC
- changes to information available in logs should not require reconfiguration of logstash shipping
x) extra dependencies should not be added: not sure where to put that in term of priority, personally I don't have much of an issue in adding dependencies as long as we have a good way to manage them
Change 269100 had a related patch set uploaded (by Gehel):
Ship Elasticsearch logs to logstash
Looking through operations/puppet repository, I see no reference to filebeat. Seems that we need to resolve T97297 first if we want to use an external log shipper (which might not be a bad idea).
filebeat replaces logstash-forwarder (mentionned in T97297), unfortunately it uses also the go runtime...
So I'm not sure what to suggest...
You could maybe continue with the gelf4j approach?
Continuing with the gelfj approach.
I understood a few things discussing with @Ottomata:
- the repository for jar packages is Archiva
- .jar are usually deployed via git-fat / trebuchet
- in the context of gelfj and this task, it might make sense to create a .deb package instead (not clear to me yet)
- elastic search logging will be configured from puppet, so management of dependencies (gelfj.jar) should be done in the same place
Oh! If you are looking specifically for logstash +gelf, we have this already.
http://apt.wikimedia.org/wikimedia/pool/main/l/logstash-gelf/
logstash-gelf package exists and plays the same role as gelfj, so let's use it. Package for trusty does not seem to exist, so next task is to build it.
This is deployed on labs (deployment-elastic0[5678].deployment-prep.eqiad.wmflabs). Results are visible on logstash-beta (look for type:gelf).
I'd still like to have feedback from @bd808 (or anyone else who knows what they are doing) before deploying to prod to make sure logs are categorized as expected (are there any info that we really want to have? Or really do not want to have...).
deployment-elastic06 seems to tag it's events with type "gelf" while deployment-elastic0[78] use type "logstash-gelf". Ideally we would configure the Elasticsearch side of this communication to consistently tag with a type of "elasticsearch" or something similar for ease of grouping. The 'type' in Logstash/Kibana is taken from the 'facility' in the original GELF packet. We can always add rules to filter-gelf.conf on the Logstash side to fix up various things as well.
So the Gotcha section on the logstash page no longer applies? I'll check why the values are not aligned (there is no reason that they should be different) and update it with "elasticsearch".
On the application side you should set 'facility' rather than 'type' or '_type'. The Logstash rules will copy that value over the 'type' value before storing the records in Elasticsearch. I'll put some clarification on that page.
Code is ready to merge and deploy, but we'll wait until T109101 to do only one cluster restart.
Mentioned in SAL [2016-02-29T12:16:43Z] <gehel> elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-02-29T14:16:40Z] <gehel> elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-02-29T18:01:34Z] <gehel> elastic2002.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-02-29T19:22:20Z] <gehel> elastic2003.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-02-29T20:21:34Z] <gehel> elastic2004.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T09:46:29Z] <gehel> elastic2016.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T10:42:38Z] <gehel> elastic2017.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T11:40:13Z] <gehel> elastic2018.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T12:37:30Z] <gehel> elastic2019.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T13:53:43Z] <gehel> elastic2020.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T14:34:03Z] <gehel> elastic2021.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T15:45:35Z] <gehel> elastic2022.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T16:40:43Z] <gehel> elastic2023.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T17:52:14Z] <gehel> elastic2024.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-01T21:29:36Z] <gehel> elastic1001.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T08:48:30Z] <gehel> elastic1003.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T10:16:00Z] <gehel> elastic1004.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T13:23:33Z] <gehel> elastic1005.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T14:32:25Z] <gehel> elastic1006.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T15:15:34Z] <gehel> elastic1007.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T16:34:08Z] <gehel> elastic1008.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T17:20:58Z] <gehel> elastic1009.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T18:58:46Z] <gehel> elastic1010.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-02T20:35:46Z] <gehel> elastic1011.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T10:07:48Z] <gehel> elastic1022.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T12:02:06Z] <gehel> elastic1023.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T12:48:56Z] <gehel> elastic1024.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T13:47:18Z] <gehel> elastic1025.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T14:34:38Z] <gehel> elastic1026.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T15:42:43Z] <gehel> elastic1027.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T16:25:50Z] <gehel> elastic1028.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T17:09:55Z] <gehel> elastic1029.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T18:41:56Z] <gehel> elastic1030.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)
Mentioned in SAL [2016-03-03T19:44:23Z] <gehel> elastic1031.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101)