Page MenuHomePhabricator

Look into shoving gerrit logs into logstash
Open, NormalPublic

Description

What it says on the tin. Should be doable in by either configuring log4j to use a socket appender to talk to logstash directly, or somehow shipping the json-format logs over to logstash and massaging them in.

Would be nice not only for my own sanity, but also makes it easier for others to debug when things aren't going right.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Even with logstash down gerrit still works.

Change 330420 had a related patch set (by Paladox) published:
Gerrit: Enable port 4560 so it can connect to prod logstash

https://gerrit.wikimedia.org/r/330420

Change 330420 abandoned by Paladox:
Gerrit: Enable port 4560 so it can connect to prod logstash

https://gerrit.wikimedia.org/r/330420

Change 326177 merged by Dzahn:
Gerrit: Enable logstash in gerrit

https://gerrit.wikimedia.org/r/326177

Change 330832 had a related patch set uploaded (by Paladox):
Gerrit: Enable logstash in gerrit

https://gerrit.wikimedia.org/r/330832

Change 330832 merged by Dzahn:
Gerrit: Add support for logstash in gerrit

https://gerrit.wikimedia.org/r/330832

Change 332531 had a related patch set (by Paladox) published:
Gerrit: Enable logstash by default for prod gerrit

https://gerrit.wikimedia.org/r/332531

@demon should we try merging the patch for logstash again?

https://gerrit.wikimedia.org/r/332531

demon added a comment.Sep 8 2017, 8:24 PM

I guess. Or just abandon the project. I haven't had time to care enough and nobody really looks at the logs except me anyway.

Dzahn added a subscriber: Dzahn.Sep 8 2017, 9:04 PM

Checked the firewall setup to make sure that wasn't our blocker (anymore) before we try it again. It looks like it should just work. I can connect just fine with telnet and see iptables rules that cover source cobalt, both v4 and v6.

cobalt.wikimedia.org has address 208.80.154.81
cobalt.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:81

covered by:

ACCEPT tcp -- 208.80.154.0/26 anywhere tcp dpt:4560
ACCEPT tcp 2620:0:861:3::/64 anywhere tcp dpt:4560

[cobalt:~] $ telnet logstash.svc.eqiad.wmnet 4560
Trying 10.2.2.36...
Connected to logstash.svc.eqiad.wmnet.
Escape character is '^]'.
Dzahn added a comment.Sep 8 2017, 9:05 PM

also for gerrit2001, fwiw, we could try from there first, as paladox suggested

[gerrit2001:~] $  telnet logstash.svc.eqiad.wmnet 4560
Trying 10.2.2.36...
Connected to logstash.svc.eqiad.wmnet.
Escape character is '^]'.
Dzahn added a comment.Sep 8 2017, 10:26 PM

Also tried the log4j.properties config change on gerrit2001 and restarted service.. while i could not see actual log entries on kibana yet.,.. at least nothing crashed and Gerrit service came back up just fine.. unlike last time.

I tested on my test instance using mediawiki-vagrant that had kibana and log stash. Using log4j would not send anything, even though it connected.

Also the log4j logstash plugin is deprecated and will be remove later, the recommend moving to file beat which would allow us to choose files for it to read like error_log instead of relying on the log4j logger which seems to not be working.

Change 377058 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Logstash: Add beats support

https://gerrit.wikimedia.org/r/377058

Paladox added a comment.EditedSep 10 2017, 6:40 PM

Im wondering could we upload the filebeat apt repo from elasticsearch to apt.wikimedia.org please?

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

package is not in packages.debian.org

Change 377060 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Install and configure filebeats

https://gerrit.wikimedia.org/r/377060

Gehel added a subscriber: Gehel.EditedSep 11 2017, 7:28 AM

We already have a number of solutions to ship logs from log4j to logstash. A number of applications use the gelf input (12201/UDP): elasticsearch, hadoop, ...

A log4j socket input is also available on 4560/TCP (as this is TCP, asynchronicity has to be taken care of on the log4j side (AsyncAppender).

I'm not very keen on adding plugins to logstash. I just managed to get rid of all the plugins we had (T174933) which greatly simplifies the logstash upgrades.

Side note: I'm surprised that gerrit is still using log4j and has not migrated to logback yet. For logback, the logstash-logback-encoder is available.

Paladox added a comment.EditedSep 11 2017, 9:19 AM

@Gehel oh, I wasn’t aware there was gelf4j. I’m wondering could you help with that please?

Looking at https://github.com/realityforge/gelf4j it shows the host name as using graylog

Do we use logstash host name there?

Please.

Gehel added a comment.Sep 11 2017, 9:31 AM

As hostname, you want to use the LVS service name (logstash.svc.eqiad.wmnet) and the 12201/UDP port. You can have a look at the elasticsearch configuration as an example.

Thanks.

Side note i filled upstream https://bugs.chromium.org/p/gerrit/issues/detail?id=7187 for logback support.

Change 377060 abandoned by Paladox:
Gerrit: Install and configure filebeats

https://gerrit.wikimedia.org/r/377060

Change 377058 abandoned by Paladox:
Logstash: Add beats support

https://gerrit.wikimedia.org/r/377058

Change 392079 had a related patch set (by Paladox) published:
[operations/puppet@production] Gerrit: Fix up logstash configuation

https://gerrit.wikimedia.org/r/392079

Change 392083 had a related patch set (by Paladox) published:
[operations/puppet@production] Gerrit: Enable logstash for gerrit

https://gerrit.wikimedia.org/r/392083

Mentioned in SAL (#wikimedia-operations) [2017-11-21T22:12:38Z] <mutante> gerrit - temp disable puppet on cobalt (prod gerrit), test switching gerrit logging to logstash on gerrit2001 - gerrit:392079 gerrit:392083 T141324

Change 392079 merged by Dzahn:
[operations/puppet@production] Gerrit: Fix up logstash configuation

https://gerrit.wikimedia.org/r/392079

Change 392083 merged by Dzahn:
[operations/puppet@production] Gerrit: Enable logstash for gerrit

https://gerrit.wikimedia.org/r/392083

Mentioned in SAL (#wikimedia-operations) [2017-11-22T00:40:24Z] <mutante> gerrit - re-enabling puppet to apply logstash change on cobalt, gerrit restart incoming (T141324)

I applied the changes first on gerrit2001 and later on cobalt. There seemed to be no issues. I can confirm on cobalt with tcpdump there is traffic to logstash.svc now:

00:54:08.411205 IP cobalt.wikimedia.org.50226 > logstash.svc.eqiad.wmnet.4560: ...
00:54:08.411875 IP logstash.svc.eqiad.wmnet.4560 > cobalt.wikimedia.org.50226:
...

That being said, i can't confirm anything on actual logstash.wikimedia.org web ui default dashboard or searching for "type:log4j" or "cobalt"/"gerrit".

@Gehel hi, we are wondering could you help us with why logstash is not showing the logs please?

@Dzahn did a tcpdump and found outgoing traffic to logstash but the ui is not showing the logs. But it works locally for me on my install.

Thanks to @EBernhardson we figured out our problem. It's due to it needing to add a es tag so it was missing the log4j filter. Which i've added in https://gerrit.wikimedia.org/r/#/c/392897/ Once deployed it should start working and we can now search by host by doing host:"gerrit"

https://gerrit.wikimedia.org/r/#/c/392943/ adds a reconnection param to make sure gerrit reconnects to logstash after logstash is restarted.

https://logstash.wikimedia.org/ now shows the first log lines from cobalt :))

Chad started working on a dashboard for it at https://logstash.wikimedia.org/goto/8e6b60e186e4409e3a1b0d02cc96fda3

You can see them by searching for "type:log4j" in kibana.

Dzahn closed this task as Resolved.Nov 22 2017, 10:22 PM
Dzahn claimed this task.
Dzahn reassigned this task from Dzahn to Paladox.
Dzahn awarded a token.

Change 332531 abandoned by Paladox:
Gerrit: Enable logstash by default for prod gerrit

Reason:
Ah yeh. Thanks.

https://gerrit.wikimedia.org/r/332531

herron reopened this task as Open.Nov 26 2018, 9:14 PM
herron added a subscriber: herron.

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

Change 475840 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog:input:file add multiline handling and ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/475840

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

I've noticed that the only way I am able to find gerrit logs in logstash is via the query type:log4j. The host for those is listed as 208.80.154.81 with some high port number (e.g., host:208.80.154.81:44588). It would be nice to get this corrected.

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

I've noticed that the only way I am able to find gerrit logs in logstash is via the query type:log4j. The host for those is listed as 208.80.154.81 with some high port number (e.g., host:208.80.154.81:44588). It would be nice to get this corrected.

Thanks! I see what you mean. But for some reason I'm still not seeing gerrit error logs in logstash.

We could address the host field and error logs by channeling gerrit logs through rsyslog (and kafka). https://gerrit.wikimedia.org/r/475840 is an approach to do that by tailing the gerrit log files.

hashar added a subscriber: hashar.Nov 27 2018, 9:26 AM

One can find the logs by searching for type:log4j.

log4j logs are emitted with org.apache.log4j.net.SocketAppender and they are send directly to Logstash collector. The Logstash configuration uses the log4j plugin:

input {
  # Log4j socket-based logger
  log4j {
    type => "log4j"
    port => "<%= @port %>"
    mode => "<%= @mode %>"
    host => "<%= @host %>"
<%- unless @plugin_id.empty? -%>
    id   => "<%= @plugin_id %>"
<%- end -%>
  }
}

I felt like it is a good thing to do since log4j and the input plugin have some potentially useful fields such as:

logger_nameorg.apache.sshd.server.session.ServerConnectionService
threadsshd-SshServer[167bae0b]-nio2-thread-3
class?
file?

Then ElasticSearch has a blog post about the future of Log4j input plugin which emphasis on the plugin deprecation with a removal for Logstash 6.0.

They recommend to use filebeat instead (whatever it can be):
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html#_deprecation_notice

Essentially that means having log4j configured to write to a log file on disk which is then consumed by filebeat. So I guess we can roll with our own rsyslog/kafka transport.

An interesting bit of config:

log4j.appender.daily.layout = org.apache.log4j.PatternLayout
log4j.appender.daily.layout.ConversionPattern=%d{YYYY-MM-dd HH:mm:ss,SSSZ} %p %c{1}:%L - %m%n

https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout log4j PatternLayout formats. It has support for various serialization formats (GELF, enhanced syslog RFC-5424 ..) for which there might already be a build in Logstash plugin.

We have other stacks using log4j (elasticsearch, confluent, cdh, druid, zookeeper). We most probably want to align all of them to use the same mechanism. The Search Platform team might be able to help on that front since they use java on a daily basis and are familiar with Log4j / Elasticsearch.

Change 475840 merged by Herron:
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/475840

Change 476592 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/476592

Change 476592 merged by Herron:
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/476592

Gerrit logs are being shipped now via a rsyslog imfile to the kafka -> logstash pipeline. There is some multi-line parsing enabled for these, but if desirable we can further tune the field parsing on these events. FWIW the pre-existing log4j method is still enabled, but when ready we could look at turning that down.

Example logs can be found with a search of host:cobalt. Please have a look at the most recent logs (past hour or so), as parsing/formatting has improved since initially enabling this source via the new pipeline.

Paladox closed this task as Resolved.Dec 4 2018, 6:42 PM
hashar reopened this task as Open.Dec 12 2018, 8:52 AM

I was having an offsite last week and could not verify the log. I can confirm logs there are logs with host:cobalt. However we loose informations when processing the logs coming from rsyslog. Notably the severity (info error) is always set to notice.

An example:

[2018-12-12 08:24:12,907] [Index-Batch-1] ERROR com.google.gerrit.index.Schema : error getting field file of ChangeData{Change{415384 (Ic5fa53348577756c0700df717f6b06b9d8de9cde), dest=operations\/puppet,refs\/heads\/production, status=n}}
com.google.gwtorm.server.OrmException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 1173c3f63c65f33ac0d063cc22c572b5f95d8244

Which in logstash has:

FieldValue
_typesyslog
facilitylocal0
hostcobalt
levelNOTICE
logsourcecobalt
programinput-file-gerrit-multiline
tagsrsyslog-shipper, kafka, syslog, es, rsyslog_multiline, normalized_message_trimmed
typesyslog

Seems we need a slightly smarter parser to extract the severity. It would be interesting to extract the Gerrit log bucket (eg: Index-Batch-1).

I also note some messages emitted by Gerrit have different timestamps format. The login ones have the timezone:

[2018-12-12 08:25:03,907 +0000] 6285825c JohnDoe a\/1234 LOGOUT
[2018-12-12 08:25:00,954] [e20df2f0] Push to gerrit2@gerrit2001.wikimedia.org...
hashar added a comment.EditedDec 12 2018, 9:05 AM

If I got it right, we are using log files as input via:

# Ship gerrit logs to ELK using startmsg_regex pattern to join multi-line events based on datestamp
# example: [2018-11-28 21:53:07,446]
rsyslog::input::file { 'gerrit-multiline':
    path           => '/var/log/gerrit/*_log',
    startmsg_regex => '^\\\\[[0-9,-\\\\ \\\\:]+\\\\]',
}

Those files are generated by log4j and their formats are defined with eg:

modules/gerrit/templates/log4j.xml.erb
<appender name="replication_log" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="encoding" value="UTF-8"/>
    <param name="File" value="/var/log/gerrit/replication_log"/>
    <!--<param name="fileAppend" value="true"/>-->
    <param name="immediateFlush" value="true"/>
    <param name="threshold" value="INFO"/>
    <layout class="org.apache.log4j.PatternLayout">
        <param name="ConversionPattern" value="[%d] [%t] %-5p %x: %m%n"/>
    </layout>
</appender>

But some do not have a value for ConversionPattern, for example the sshd_log file (which emits the LOGOUT message from my previous example):

<appender name="sshd_log" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="encoding" value="UTF-8"/>
    <param name="File" value="/var/log/gerrit/sshd_log"/>
    <!--<param name="fileAppend" value="true"/>-->
    <param name="immediateFlush" value="true"/>
    <param name="threshold" value="INFO"/>
    <layout class="com.google.gerrit.sshd.SshLogLayout"/>
</appender>

But maybe com.google.gerrit.sshd.SshLogLayout is different from org.apache.log4j.PatternLayout.

Would be nice to have them all normalized in a way that is easy to process in the gerrit-multiline input.

The other thing I have noticed, is that Gerrit log4j seems to send root logging to logstash.svc.eqiad.wmnet:4560. I can not tell:

  • whether it send anything (catch all) or whether that is the fallback appender in case it has not been processed by other bits.
  • I have no idea what is being send directly to logstash.svc.eqiad.wmnet:4560

The relevant log4j configuration bits:

/var/lib/gerrit2/review_site/etc/log4j.xml
<log4j:configuration>
    <appender name="error_log" class="org.apache.log4j.DailyRollingFileAppender">
        <param name="encoding" value="UTF-8"/>
        <param name="File" value="/var/log/gerrit/error_log"/>
        <!--<param name="fileAppend" value="true"/>-->
        <param name="immediateFlush" value="true"/>
        <param name="threshold" value="INFO"/>
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern" value="[%d] [%t] %-5p %c %x: %m%n"/>
        </layout>
    </appender>

    <!-- Logstash -->
    <appender name="socket" class="org.apache.log4j.net.SocketAppender">
        <param name="port" value="4560"/>
        <param name="remoteHost" value="logstash.svc.eqiad.wmnet"/>
        <param name="ReconnectionDelay" value="30000"/>
    </appender>
    <appender name="async" class="org.apache.log4j.AsyncAppender">
        <param name="Blocking" value="false"/>
        <param name="BufferSize" value="64"/>
        <appender-ref ref="socket"/>
    </appender>


    <root>
        <level value="INFO"/>
        <appender-ref ref="error_log"/>
        <appender-ref ref="async"/>
    </root>

</log4j:configuration>
Paladox closed this task as Resolved.Feb 12 2019, 5:27 PM

@herron did this. We can remove the socket part from log4j now.

Closing as resolved as the logs should be in logstash now.

Paladox reopened this task as Open.Feb 12 2019, 5:28 PM

Oh, saw hashar last comments.

Gehel added a comment.Apr 18 2019, 2:39 PM

I had a conversation with @hashar about this topic. So here are a few idea:

  • treating log messages as text only will loose a ton of useful information (or require significant effort to re-parse them)
  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.
  • log4j (which is used by gerrit) is an outdated logging framework, better solutions are log4j2 (the "2" is actually a completely different version, quite fare from log4j v1) or logback. But we'll probably wait for upstream to upgrade their logging.

I have a patch upstream to add support for log4j2, but since then upstream have a new logging framework that is a frontend called flogger which calls log4j (backend). log4j2 will need to be added to flogger so that upstream can support log4j2.

See https://gerrit-review.googlesource.com/c/gerrit/+/142811

  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.

This sounds like a solid option taking into account the ongoing migration towards shipping to logstash via the local rsyslogd, which produces to the Kafka logging pipeline. Can this output json formatted messages to syslog prefixed by an @cee cookie?

mmodell added a parent task: Restricted Task.May 3 2019, 3:34 PM
mmodell added a subscriber: mmodell.EditedMay 3 2019, 3:39 PM

flogger is google's fluent logger framework for java: https://google.github.io/flogger/

Their docs include Anatomy of a logging API which is a really good read about the benefits of the fluent api which I would imagine could apply to languages even beyond the jvm.

  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.

This sounds like a solid option taking into account the ongoing migration towards shipping to logstash via the local rsyslogd, which produces to the Kafka logging pipeline. Can this output json formatted messages to syslog prefixed by an @cee cookie?

@Paladox points out that gerrit already supports some json log formatting: https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#log.jsonLogging

It appears to use net.logstash.log4j.JSONEventLayoutV1 if that config is set, which seems promising: https://github.com/GerritCodeReview/gerrit/blob/482595dbafe10f78014a79c41c0a656fb6ef0a2d/java/com/google/gerrit/pgm/util/ErrorLogFile.java#L84-L89

Change 508391 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Configure logging in json to error_log.json

https://gerrit.wikimedia.org/r/508391

Change 508391 merged by Dzahn:
[operations/puppet@production] Gerrit: Configure logging in json to gerrit.json

https://gerrit.wikimedia.org/r/508391

Dzahn added a comment.May 8 2019, 6:16 PM

Deployed the change above and restarted Gerrit. The new file /var/log/gerrit/gerrit.json has been created now and logs are written to it. Works! Thanks for the change, Paladox.

Deployed the change above and restarted Gerrit. The new file /var/log/gerrit/gerrit.json has been created now and logs are written to it. Works! Thanks for the change, Paladox.

nice :)

@herron I assume there are additional steps that we need to take in puppet to get these json logs to be consumed by rsyslog so they end up as structured logs in logstash?

Change 509172 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: only ship gerrit.json and sshd_log to logstash, not *_log

https://gerrit.wikimedia.org/r/509172

Change 509172 merged by Dzahn:
[operations/puppet@production] gerrit: only ship gerrit.json to logstash, not *_log

https://gerrit.wikimedia.org/r/509172

...
Messages all look usable (apart from the msg field which is the raw json that gets parsed into the individual fields). I do see a lot of No cached mapping for this field with a little scary looking ! icon. I'm not clear what I can do about that :\

The tooltip also says to refresh the fields list from Management > Index patterns. Did it and that bring a lot of new fields as well as some type conflicts on:

  • request_time
  • response_size
  • upstream_time

Might be related to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538082/

There are some oddities such as:

levelWARNING, %{SEVERITY_LABEL}

It would be great to have:

  • type to be set to gerrit instead of syslog
  • channel to be set to the value of logger_name

I have no idea where that can be tweaked though.

hashar added a comment.Thu, Oct 3, 8:17 AM

It seems the message parsing has been enhanced recently. At least on Sept 23rd I am pretty sure they were not entirely parsed. Notably the thread name / file are available.

The exception structure is correctly split in several fields. So that when Gerrit json has:

gerrit.json
{
...
   "exception" : {
      "exception_class" : "java.util.concurrent.ExecutionException",
      "exception_message" : "org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9",
      "stacktrace" : "java.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9
     <snip for clarity>"
   },

We have:

exception.exception_classjava.util.concurrent.ExecutionException
exception.exception_messageorg.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9
exception.stacktracejava.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9 <snip for clarity>"

We still need:

The level / severity mapping is faulty:

gerrit.json
{
   "level" : "WARN"
}

Yields:

severitynotice
levelWARNING, %{SEVERITY_LABEL}

And:

  • type to be set to gerrit instead of `syslog
  • channel to be set to the value of logger_name

And whenever the json has been parsed and used to populate fields, we should probably dismiss the msg field which is the raw json message :]