Page MenuHomePhabricator

Remove logstash from the CirrusSearch servers
Open, MediumPublic5 Estimated Story Points

Description

Reopening as of 10 Apr 2026.

Cirrussearch hosts run a local logstash to improve log formatting before piping to the Observability logstash infra. We tried (and failed) to remove the local logstash in 2023. In the 2 1/2 years since then, we've migrated from Elastic->OpenSearch, so now is a good time to revisit. Removing a dependency on Puppet plans we don't own (logstash) should improve our lives and the lives of Observability (who do own the code).

AC:

  • Remove logstash from cirrus Puppet plans
  • Find a better way to format logs and apply it.

Prior AC:

A local logstash instance was added to the Elasticsearch servers to manage log ingestion. Now that we have upgraded to Elasticsearch 7.x, we should have a proper json logger available and should be able to directly ship logs to central Logstash without a local logstash instance.

AC:

  • Elasticsearch logs are properly ingested by our central Logstash cluster
  • logstash is removed from all Elasticsearch servers

Event Timeline

MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Change 885438 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add udp_json_logback_compat_profile

https://gerrit.wikimedia.org/r/885438

Change 885439 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add ESJsonLayout log config

https://gerrit.wikimedia.org/r/885439

Change 885438 merged by Bking:

[operations/puppet@production] elastic: add udp_json_logback_compat_profile

https://gerrit.wikimedia.org/r/885438

bking updated Other Assignee, added: RKemper.

Change 885439 merged by Bking:

[operations/puppet@production] elastic relforge: add ESJsonLayout log config

https://gerrit.wikimedia.org/r/885439

Change 888078 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic relforge: update logstash transport

https://gerrit.wikimedia.org/r/888078

Change 888078 merged by Bking:

[operations/puppet@production] elastic relforge: update logstash transport

https://gerrit.wikimedia.org/r/888078

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:27:14Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:31:01Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:47:01Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-02-09T20:50:52Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335

We've deployed configuration that looks very similar to what is used by OpenSearch, but messages don't make it to Logstash. I'm not entirely sure where those are dropped (elasticsearch, syslog, kafka, logstash). Some help from observability to track this down would be welcomed.

There are a couple issues.

One is in rsyslog where 50-udp-json-logback-compat.conf loads mmjsonparse and omkafka in conflict with 30-output-kafka.conf.

Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'mmjsonparse' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]
Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'omkafka' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]

I presume using profile::rsyslog::udp_localhost_compat is the right answer, but I don't know the major differences between udp_json_logback_compat and udp_localhost_compat. @herron could maybe weigh in on this?

The second is in ES7 and log4j.properties preventing sending the logs to rsyslog:

Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,763 main ERROR Could not create plugin of type class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalArgumentException: A>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.net.DatagramSocketManager.getSocketManager(DatagramSocketManager.java:66)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SocketAppender.createSocketManager(SocketAppender.java:438)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SyslogAppender$Builder.build(SyslogAppender.java:134)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.appender.SyslogAppender$Builder.build(SyslogAppender.java:49)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:123)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:514)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:238)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:250)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:547)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:236)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:129)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:354)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,764 main ERROR Unable to invoke factory method in class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalStateException: N>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.findFactoryMethod(PluginBuilder.java:235)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:135)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,764 main ERROR Unable to invoke factory method in class org.apache.logging.log4j.core.appender.SyslogAppender for element syslog: java.lang.IllegalStateException: N>
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.findFactoryMethod(PluginBuilder.java:235)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:135)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:959)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:899)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:891)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:514)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:238)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:250)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:547)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:236)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.common.logging.LogConfigurator.configure(LogConfigurator.java:129)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:354)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,765 main ERROR Null object returned for syslog in Appenders.
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "root"
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "index.indexing.slowlog.index"
Feb 16 21:40:45 relforge1003 elasticsearch[2096645]: 2023-02-16 21:40:45,766 main ERROR Unable to locate appender "ship_to_logstash" for logger config "index.search.slowlog"

There are a couple issues.

One is in rsyslog where 50-udp-json-logback-compat.conf loads mmjsonparse and omkafka in conflict with 30-output-kafka.conf.

Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'mmjsonparse' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]
Feb 16 22:08:12 relforge1003 rsyslogd[2102764]: module 'omkafka' already in this config, cannot be added  [v8.2102.0 try https://www.rsyslog.com/e/2221 ]

Fwiw T292175 is a fairly long standing bug about this

I presume using profile::rsyslog::udp_localhost_compat is the right answer, but I don't know the major differences between udp_json_logback_compat and udp_localhost_compat. @herron could maybe weigh in on this?

Sure, the gist as I recall is udp_localhost_compat expects syslog cee formatted json, while udp_json_logback_compat expects raw json. The naming could definitely be improved. Hopefully the rsyslog templates help clarify further

ruleset(name="ruleset_udp_json_logback_compat" queue.type="LinkedList") {
...           template="template_udp_json_logback_compat"

template(name="template_udp_json_logback_compat" type="list") {
  property(name="$!all-json")
}
ruleset(name="udp_localhost_to_kafka" queue.type="LinkedList") {
...         template="syslog_cee"

template(name="syslog_cee" type="list") {
  constant(value="{")
  property(name="timereported" outname="timestamp" format="jsonf" dateformat="rfc3339")
  constant(value=", ")
  property(name="hostname" outname="logsource" format="jsonf")
  constant(value=", ")
  property(name="hostname" outname="host" format="jsonf")
  constant(value=", ")
  property(name="programname" outname="program" format="jsonf")
  constant(value=", ")
  property(name="syslogseverity-text" outname="severity" format="jsonf")
  constant(value=", ")
  property(name="syslogfacility-text" outname="facility" format="jsonf")
  constant(value=", ")
  property(name="timereported" outname="rsyslog.timereported" format="jsonf" dateformat="rfc3339")
  constant(value=", ")
  property(name="hostname" outname="rsyslog.hostname" format="jsonf")
  constant(value=", ")
  property(name="programname" outname="rsyslog.programname" format="jsonf")
  constant(value=", ")
  property(name="syslogseverity-text" outname="rsyslog.severity" format="jsonf")
  constant(value=", ")
  property(name="syslogfacility-text" outname="rsyslog.facility" format="jsonf")
  constant(value=", ")
  # Skip leading curly brace
  property(name="$!all-json" position.from="2")
}

We talked about this in the Observability meeting today and determined rsyslog is not blocking the messages.

After investigation, configuring log4j to talk directly to syslog is adding too much complexity related to the Java Security Manager. We will keep logstash to do log forwarding for now.

After investigation, configuring log4j to talk directly to syslog is adding too much complexity related to the Java Security Manager. We will keep logstash to do log forwarding for now.

And for posterity's sake, here's an example Security Manager related log message we were seeing while still working on this ticket:

Feb 21 20:34:45 relforge1003 elasticsearch[3475059]: 2023-02-21 20:34:45,584 main ERROR An exception occurred processing Appender ship_to_logstash java.security.AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:12201" "connect,resolve")

(That log line is from when we tried it with gelf's port instead of the syslog one to see if that would fix things, so the port number is different from what it would normally show but it illustrates the error)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T15:17:08Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335

Mentioned in SAL (#wikimedia-operations) [2023-11-16T15:21:55Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335

bking renamed this task from Remove logstash from the Search Elasticsearch servers to Remove logstash from the CirrusSearch servers.Fri, Apr 10, 6:56 PM
bking reopened this task as Open.
bking updated the task description. (Show Details)

This thread from OpenSearch Slack might be useful. Specifically, Michael Froh's example log4j config:

appender.console.layout.alwaysWriteExceptions = false
appender.console.layout.pattern = {"level":"%p","timestamp":"%d{ISO8601}","thread":"%t","file":"%F","line":"%L","message":"%notEmpty{%enc{%marker}{JSON}}%enc{%m}{JSON}"%notEmpty{,"stacktrace":["%throwable{separator(",")}"]}}%n

Another user expressed the following concerns:

Formatting logs into JSON by a PatternLayout can break by "JSON injection". Depending on the log message, the data may be corrupted. There are several options to let log4j write JSON in a safe manner. Log4j2 has its own JSONLayout for that use-case. Have you tried out something like that?

Change #1270082 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: remove logstash profile

https://gerrit.wikimedia.org/r/1270082

Change #1270082 merged by Bking:

[operations/puppet@production] cloudelastic: remove logstash profile

https://gerrit.wikimedia.org/r/1270082

Icinga downtime and Alertmanager silence (ID=ca31fb96-2b51-4966-8a46-aba99844d64d) set by bking@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: still fixing Puppet

cloudelastic1012.eqiad.wmnet