Page MenuHomePhabricator

Remove logstash from the Search Elasticsearch servers
Closed, DeclinedPublic5 Estimated Story Points

Description

A local logstash instance was added to the Elasticsearch servers to manage log ingestion. Now that we have upgraded to Elasticsearch 7.x, we should have a proper json logger available and should be able to directly ship logs to central Logstash without a local logstash instance.

AC:

  • Elasticsearch logs are properly ingested by our central Logstash cluster
  • logstash is removed from all Elasticsearch servers

Event Timeline

MPhamWMF triaged this task as Medium priority.Dec 5 2022, 4:30 PM
MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Change 885438 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add udp_json_logback_compat_profile

https://gerrit.wikimedia.org/r/885438

Change 885439 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add ESJsonLayout log config

https://gerrit.wikimedia.org/r/885439

Change 885438 merged by Bking:

[operations/puppet@production] elastic: add udp_json_logback_compat_profile

https://gerrit.wikimedia.org/r/885438

bking updated Other Assignee, added: RKemper.

Change 885439 merged by Bking:

[operations/puppet@production] elastic relforge: add ESJsonLayout log config

https://gerrit.wikimedia.org/r/885439

Change 888078 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic relforge: update logstash transport

https://gerrit.wikimedia.org/r/888078

Change 888078 merged by Bking:

[operations/puppet@production] elastic relforge: update logstash transport

https://gerrit.wikimedia.org/r/888078

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:27:14Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:31:01Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:47:01Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:50:52Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

We've deployed configuration that looks very similar to what is used by OpenSearch, but messages don't make it to Logstash. I'm not entirely sure where those are dropped (elasticsearch, syslog, kafka, logstash). Some help from observability to track this down would be welcomed.

There are a couple issues.

One is in rsyslog where 50-udp-json-logback-compat.conf loads mmjsonparse and omkafka in conflict with 30-output-kafka.conf.

Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'mmjsonparse' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]
Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'omkafka' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]

I presume using profile::rsyslog::udp_localhost_compat is the right answer, but I don't know the major differences between udp_json_logback_compat and udp_localhost_compat. @herron could maybe weigh in on this?

The second is in ES7 and log4j.properties preventing sending the logs to rsyslog:

Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,763 main ERROR Could not create plugin of type class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalArgumentException: A>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.net.DatagramSocketManager.getSocketManager(DatagramSocketManager.java:66)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SocketAppender.createSocketManager(SocketAppender.java:438)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SyslogAppender$Builder.build(SyslogAppender.java:134)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SyslogAppender$Builder.build(SyslogAppender.java:49)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:123)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:514)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:238)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:250)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:547)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:236)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:129)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:354)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,764 main ERROR Unable to invoke factory method in class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalStateException: N>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.findFactoryMethod(PluginBuilder.java:235)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:135)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,764 main ERROR Unable to invoke factory method in class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalStateException: N>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.findFactoryMethod(PluginBuilder.java:235)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:135)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:514)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:238)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:250)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:547)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:236)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:129)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:354)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,765 main ERROR Null object returned for syslog in Appenders.
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "root"
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "index.indexing.slowlog.index"
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "index.search.slowlog"

There are a couple issues.

One is in rsyslog where 50-udp-json-logback-compat.conf loads mmjsonparse and omkafka in conflict with 30-output-kafka.conf.

Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'mmjsonparse' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]
Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'omkafka' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]

Fwiw T292175 is a fairly long standing bug about this

I presume using profile::rsyslog::udp_localhost_compat is the right answer, but I don't know the major differences between udp_json_logback_compat and udp_localhost_compat. @herron could maybe weigh in on this?

Sure, the gist as I recall is udp_localhost_compat expects syslog cee formatted json, while udp_json_logback_compat expects raw json. The naming could definitely be improved. Hopefully the rsyslog templates help clarify further

ruleset(name="ruleset_udp_json_logback_compat" queue.type="LinkedList") {
...           template="template_udp_json_logback_compat"

template(name="template_udp_json_logback_compat" type="list") {
  property(name="$!all-json")
}
ruleset(name="udp_localhost_to_kafka" queue.type="LinkedList") {
...         template="syslog_cee"

template(name="syslog_cee" type="list") {
  constant(value="{")
  property(name="timereported" outname="timestamp" format="jsonf" dateformat="rfc3339")
  constant(value=", ")
  property(name="hostname" outname="logsource" format="jsonf")
  constant(value=", ")
  property(name="hostname" outname="host" format="jsonf")
  constant(value=", ")
  property(name="programname" outname="program" format="jsonf")
  constant(value=", ")
  property(name="syslogseverity-text" outname="severity" format="jsonf")
  constant(value=", ")
  property(name="syslogfacility-text" outname="facility" format="jsonf")
  constant(value=", ")
  property(name="timereported" outname="rsyslog.timereported" format="jsonf" dateformat="rfc3339")
  constant(value=", ")
  property(name="hostname" outname="rsyslog.hostname" format="jsonf")
  constant(value=", ")
  property(name="programname" outname="rsyslog.programname" format="jsonf")
  constant(value=", ")
  property(name="syslogseverity-text" outname="rsyslog.severity" format="jsonf")
  constant(value=", ")
  property(name="syslogfacility-text" outname="rsyslog.facility" format="jsonf")
  constant(value=", ")
  # Skip leading curly brace
  property(name="$!all-json" position.from="2")
}

We talked about this in the Observability meeting today and determined rsyslog is not blocking the messages.

After investigation, configuring log4j to talk directly to syslog is adding too much complexity related to the Java Security Manager. We will keep logstash to do log forwarding for now.

After investigation, configuring log4j to talk directly to syslog is adding too much complexity related to the Java Security Manager. We will keep logstash to do log forwarding for now.

And for posterity's sake, here's an example Security Manager related log message we were seeing while still working on this ticket:

Feb 21 20:34:45 relforge1003 elasticsearch[3475059]: 2023-02-21 20:34:45,584 main ERROR An exception occurred processing Appender ship_to_logstash java.security.AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:12201" "connect,resolve")

(That log line is from when we tried it with gelf's port instead of the syslog one to see if that would fix things, so the port number is different from what it would normally show but it illustrates the error)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T15:17:08Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-11-16T15:21:55Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335