Page MenuHomePhabricator

Look into shoving gerrit logs into logstash
Closed, ResolvedPublic

Description

What it says on the tin. Should be doable in by either configuring log4j to use a socket appender to talk to logstash directly, or somehow shipping the json-format logs over to logstash and massaging them in.

Would be nice not only for my own sanity, but also makes it easier for others to debug when things aren't going right.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 377060 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Install and configure filebeats

https://gerrit.wikimedia.org/r/377060

We already have a number of solutions to ship logs from log4j to logstash. A number of applications use the gelf input (12201/UDP): elasticsearch, hadoop, ...

A log4j socket input is also available on 4560/TCP (as this is TCP, asynchronicity has to be taken care of on the log4j side (AsyncAppender).

I'm not very keen on adding plugins to logstash. I just managed to get rid of all the plugins we had (T174933) which greatly simplifies the logstash upgrades.

Side note: I'm surprised that gerrit is still using log4j and has not migrated to logback yet. For logback, the logstash-logback-encoder is available.

@Gehel oh, I wasn’t aware there was gelf4j. I’m wondering could you help with that please?

Looking at https://github.com/realityforge/gelf4j it shows the host name as using graylog

Do we use logstash host name there?

Please.

As hostname, you want to use the LVS service name (logstash.svc.eqiad.wmnet) and the 12201/UDP port. You can have a look at the elasticsearch configuration as an example.

Change 377060 abandoned by Paladox:
Gerrit: Install and configure filebeats

https://gerrit.wikimedia.org/r/377060

Change 377058 abandoned by Paladox:
Logstash: Add beats support

https://gerrit.wikimedia.org/r/377058

Change 392079 had a related patch set (by Paladox) published:
[operations/puppet@production] Gerrit: Fix up logstash configuation

https://gerrit.wikimedia.org/r/392079

Change 392083 had a related patch set (by Paladox) published:
[operations/puppet@production] Gerrit: Enable logstash for gerrit

https://gerrit.wikimedia.org/r/392083

Mentioned in SAL (#wikimedia-operations) [2017-11-21T22:12:38Z] <mutante> gerrit - temp disable puppet on cobalt (prod gerrit), test switching gerrit logging to logstash on gerrit2001 - gerrit:392079 gerrit:392083 T141324

Change 392079 merged by Dzahn:
[operations/puppet@production] Gerrit: Fix up logstash configuation

https://gerrit.wikimedia.org/r/392079

Change 392083 merged by Dzahn:
[operations/puppet@production] Gerrit: Enable logstash for gerrit

https://gerrit.wikimedia.org/r/392083

Mentioned in SAL (#wikimedia-operations) [2017-11-22T00:40:24Z] <mutante> gerrit - re-enabling puppet to apply logstash change on cobalt, gerrit restart incoming (T141324)

I applied the changes first on gerrit2001 and later on cobalt. There seemed to be no issues. I can confirm on cobalt with tcpdump there is traffic to logstash.svc now:

00:54:08.411205 IP cobalt.wikimedia.org.50226 > logstash.svc.eqiad.wmnet.4560: ...
00:54:08.411875 IP logstash.svc.eqiad.wmnet.4560 > cobalt.wikimedia.org.50226:
...

That being said, i can't confirm anything on actual logstash.wikimedia.org web ui default dashboard or searching for "type:log4j" or "cobalt"/"gerrit".

@Gehel hi, we are wondering could you help us with why logstash is not showing the logs please?

@Dzahn did a tcpdump and found outgoing traffic to logstash but the ui is not showing the logs. But it works locally for me on my install.

Thanks to @EBernhardson we figured out our problem. It's due to it needing to add a es tag so it was missing the log4j filter. Which i've added in https://gerrit.wikimedia.org/r/#/c/392897/ Once deployed it should start working and we can now search by host by doing host:"gerrit"

https://gerrit.wikimedia.org/r/#/c/392943/ adds a reconnection param to make sure gerrit reconnects to logstash after logstash is restarted.

https://logstash.wikimedia.org/ now shows the first log lines from cobalt :))

Chad started working on a dashboard for it at https://logstash.wikimedia.org/goto/8e6b60e186e4409e3a1b0d02cc96fda3

You can see them by searching for "type:log4j" in kibana.

Dzahn claimed this task.
Dzahn reassigned this task from Dzahn to Paladox.
Dzahn awarded a token.

Change 332531 abandoned by Paladox:
Gerrit: Enable logstash by default for prod gerrit

Reason:
Ah yeh. Thanks.

https://gerrit.wikimedia.org/r/332531

herron subscribed.

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

Change 475840 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog:input:file add multiline handling and ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/475840

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

I've noticed that the only way I am able to find gerrit logs in logstash is via the query type:log4j. The host for those is listed as 208.80.154.81 with some high port number (e.g., host:208.80.154.81:44588). It would be nice to get this corrected.

Reopening this because I'm not seeing any logs for host:cobalt in logstash currently, and we're in the process of moving log producers over to the new rsyslog -> kafka -> logstash pipeline in T205852

I've noticed that the only way I am able to find gerrit logs in logstash is via the query type:log4j. The host for those is listed as 208.80.154.81 with some high port number (e.g., host:208.80.154.81:44588). It would be nice to get this corrected.

Thanks! I see what you mean. But for some reason I'm still not seeing gerrit error logs in logstash.

We could address the host field and error logs by channeling gerrit logs through rsyslog (and kafka). https://gerrit.wikimedia.org/r/475840 is an approach to do that by tailing the gerrit log files.

One can find the logs by searching for type:log4j.

log4j logs are emitted with org.apache.log4j.net.SocketAppender and they are send directly to Logstash collector. The Logstash configuration uses the log4j plugin:

input {
  # Log4j socket-based logger
  log4j {
    type => "log4j"
    port => "<%= @port %>"
    mode => "<%= @mode %>"
    host => "<%= @host %>"
<%- unless @plugin_id.empty? -%>
    id   => "<%= @plugin_id %>"
<%- end -%>
  }
}

I felt like it is a good thing to do since log4j and the input plugin have some potentially useful fields such as:

logger_nameorg.apache.sshd.server.session.ServerConnectionService
threadsshd-SshServer[167bae0b]-nio2-thread-3
class?
file?

Then ElasticSearch has a blog post about the future of Log4j input plugin which emphasis on the plugin deprecation with a removal for Logstash 6.0.

They recommend to use filebeat instead (whatever it can be):
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html#_deprecation_notice

Essentially that means having log4j configured to write to a log file on disk which is then consumed by filebeat. So I guess we can roll with our own rsyslog/kafka transport.

An interesting bit of config:

log4j.appender.daily.layout = org.apache.log4j.PatternLayout
log4j.appender.daily.layout.ConversionPattern=%d{YYYY-MM-dd HH:mm:ss,SSSZ} %p %c{1}:%L - %m%n

https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout log4j PatternLayout formats. It has support for various serialization formats (GELF, enhanced syslog RFC-5424 ..) for which there might already be a build in Logstash plugin.

We have other stacks using log4j (elasticsearch, confluent, cdh, druid, zookeeper). We most probably want to align all of them to use the same mechanism. The Search Platform team might be able to help on that front since they use java on a daily basis and are familiar with Log4j / Elasticsearch.

Change 475840 merged by Herron:
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/475840

Change 476592 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/476592

Change 476592 merged by Herron:
[operations/puppet@production] rsyslog: input::file add multiline handling & ship gerrit logs to ELK

https://gerrit.wikimedia.org/r/476592

Gerrit logs are being shipped now via a rsyslog imfile to the kafka -> logstash pipeline. There is some multi-line parsing enabled for these, but if desirable we can further tune the field parsing on these events. FWIW the pre-existing log4j method is still enabled, but when ready we could look at turning that down.

Example logs can be found with a search of host:cobalt. Please have a look at the most recent logs (past hour or so), as parsing/formatting has improved since initially enabling this source via the new pipeline.

I was having an offsite last week and could not verify the log. I can confirm logs there are logs with host:cobalt. However we loose informations when processing the logs coming from rsyslog. Notably the severity (info error) is always set to notice.

An example:

[2018-12-12 08:24:12,907] [Index-Batch-1] ERROR com.google.gerrit.index.Schema : error getting field file of ChangeData{Change{415384 (Ic5fa53348577756c0700df717f6b06b9d8de9cde), dest=operations\/puppet,refs\/heads\/production, status=n}}
com.google.gwtorm.server.OrmException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 1173c3f63c65f33ac0d063cc22c572b5f95d8244

Which in logstash has:

FieldValue
_typesyslog
facilitylocal0
hostcobalt
levelNOTICE
logsourcecobalt
programinput-file-gerrit-multiline
tagsrsyslog-shipper, kafka, syslog, es, rsyslog_multiline, normalized_message_trimmed
typesyslog

Seems we need a slightly smarter parser to extract the severity. It would be interesting to extract the Gerrit log bucket (eg: Index-Batch-1).

I also note some messages emitted by Gerrit have different timestamps format. The login ones have the timezone:

[2018-12-12 08:25:03,907 +0000] 6285825c JohnDoe a\/1234 LOGOUT
[2018-12-12 08:25:00,954] [e20df2f0] Push to gerrit2@gerrit2001.wikimedia.org...

If I got it right, we are using log files as input via:

# Ship gerrit logs to ELK using startmsg_regex pattern to join multi-line events based on datestamp
# example: [2018-11-28 21:53:07,446]
rsyslog::input::file { 'gerrit-multiline':
    path           => '/var/log/gerrit/*_log',
    startmsg_regex => '^\\\\[[0-9,-\\\\ \\\\:]+\\\\]',
}

Those files are generated by log4j and their formats are defined with eg:

modules/gerrit/templates/log4j.xml.erb
<appender name="replication_log" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="encoding" value="UTF-8"/>
    <param name="File" value="/var/log/gerrit/replication_log"/>
    <!--<param name="fileAppend" value="true"/>-->
    <param name="immediateFlush" value="true"/>
    <param name="threshold" value="INFO"/>
    <layout class="org.apache.log4j.PatternLayout">
        <param name="ConversionPattern" value="[%d] [%t] %-5p %x: %m%n"/>
    </layout>
</appender>

But some do not have a value for ConversionPattern, for example the sshd_log file (which emits the LOGOUT message from my previous example):

<appender name="sshd_log" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="encoding" value="UTF-8"/>
    <param name="File" value="/var/log/gerrit/sshd_log"/>
    <!--<param name="fileAppend" value="true"/>-->
    <param name="immediateFlush" value="true"/>
    <param name="threshold" value="INFO"/>
    <layout class="com.google.gerrit.sshd.SshLogLayout"/>
</appender>

But maybe com.google.gerrit.sshd.SshLogLayout is different from org.apache.log4j.PatternLayout.

Would be nice to have them all normalized in a way that is easy to process in the gerrit-multiline input.

The other thing I have noticed, is that Gerrit log4j seems to send root logging to logstash.svc.eqiad.wmnet:4560. I can not tell:

  • whether it send anything (catch all) or whether that is the fallback appender in case it has not been processed by other bits.
  • I have no idea what is being send directly to logstash.svc.eqiad.wmnet:4560

The relevant log4j configuration bits:

/var/lib/gerrit2/review_site/etc/log4j.xml
<log4j:configuration>
    <appender name="error_log" class="org.apache.log4j.DailyRollingFileAppender">
        <param name="encoding" value="UTF-8"/>
        <param name="File" value="/var/log/gerrit/error_log"/>
        <!--<param name="fileAppend" value="true"/>-->
        <param name="immediateFlush" value="true"/>
        <param name="threshold" value="INFO"/>
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern" value="[%d] [%t] %-5p %c %x: %m%n"/>
        </layout>
    </appender>

    <!-- Logstash -->
    <appender name="socket" class="org.apache.log4j.net.SocketAppender">
        <param name="port" value="4560"/>
        <param name="remoteHost" value="logstash.svc.eqiad.wmnet"/>
        <param name="ReconnectionDelay" value="30000"/>
    </appender>
    <appender name="async" class="org.apache.log4j.AsyncAppender">
        <param name="Blocking" value="false"/>
        <param name="BufferSize" value="64"/>
        <appender-ref ref="socket"/>
    </appender>


    <root>
        <level value="INFO"/>
        <appender-ref ref="error_log"/>
        <appender-ref ref="async"/>
    </root>

</log4j:configuration>

@herron did this. We can remove the socket part from log4j now.

Closing as resolved as the logs should be in logstash now.

Oh, saw hashar last comments.

I had a conversation with @hashar about this topic. So here are a few idea:

  • treating log messages as text only will loose a ton of useful information (or require significant effort to re-parse them)
  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.
  • log4j (which is used by gerrit) is an outdated logging framework, better solutions are log4j2 (the "2" is actually a completely different version, quite fare from log4j v1) or logback. But we'll probably wait for upstream to upgrade their logging.

I have a patch upstream to add support for log4j2, but since then upstream have a new logging framework that is a frontend called flogger which calls log4j (backend). log4j2 will need to be added to flogger so that upstream can support log4j2.

See https://gerrit-review.googlesource.com/c/gerrit/+/142811

  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.

This sounds like a solid option taking into account the ongoing migration towards shipping to logstash via the local rsyslogd, which produces to the Kafka logging pipeline. Can this output json formatted messages to syslog prefixed by an @cee cookie?

mmodell added a parent task: Restricted Task.May 3 2019, 3:34 PM

flogger is google's fluent logger framework for java: https://google.github.io/flogger/

Their docs include Anatomy of a logging API which is a really good read about the benefits of the fluent api which I would imagine could apply to languages even beyond the jvm.

  • structured logging from log4j can be exposed in a number of ways. The easier is probably to continue logging to file, use syslog for transport, but have messages in a structured format. A json layout could be used for that (here is one from Jetbrains, which I haven't tested, but I tend to trust Jetbrains). This would allow to output all logging context and not just a few basic information.

This sounds like a solid option taking into account the ongoing migration towards shipping to logstash via the local rsyslogd, which produces to the Kafka logging pipeline. Can this output json formatted messages to syslog prefixed by an @cee cookie?

@Paladox points out that gerrit already supports some json log formatting: https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#log.jsonLogging

It appears to use net.logstash.log4j.JSONEventLayoutV1 if that config is set, which seems promising: https://github.com/GerritCodeReview/gerrit/blob/482595dbafe10f78014a79c41c0a656fb6ef0a2d/java/com/google/gerrit/pgm/util/ErrorLogFile.java#L84-L89

Change 508391 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Configure logging in json to error_log.json

https://gerrit.wikimedia.org/r/508391

Change 508391 merged by Dzahn:
[operations/puppet@production] Gerrit: Configure logging in json to gerrit.json

https://gerrit.wikimedia.org/r/508391

Deployed the change above and restarted Gerrit. The new file /var/log/gerrit/gerrit.json has been created now and logs are written to it. Works! Thanks for the change, Paladox.

Deployed the change above and restarted Gerrit. The new file /var/log/gerrit/gerrit.json has been created now and logs are written to it. Works! Thanks for the change, Paladox.

nice :)

@herron I assume there are additional steps that we need to take in puppet to get these json logs to be consumed by rsyslog so they end up as structured logs in logstash?

Change 509172 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: only ship gerrit.json and sshd_log to logstash, not *_log

https://gerrit.wikimedia.org/r/509172

Change 509172 merged by Dzahn:
[operations/puppet@production] gerrit: only ship gerrit.json to logstash, not *_log

https://gerrit.wikimedia.org/r/509172

...
Messages all look usable (apart from the msg field which is the raw json that gets parsed into the individual fields). I do see a lot of No cached mapping for this field with a little scary looking ! icon. I'm not clear what I can do about that :\

The tooltip also says to refresh the fields list from Management > Index patterns. Did it and that bring a lot of new fields as well as some type conflicts on:

  • request_time
  • response_size
  • upstream_time

Might be related to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538082/

There are some oddities such as:

levelWARNING, %{SEVERITY_LABEL}

It would be great to have:

  • type to be set to gerrit instead of syslog
  • channel to be set to the value of logger_name

I have no idea where that can be tweaked though.

It seems the message parsing has been enhanced recently. At least on Sept 23rd I am pretty sure they were not entirely parsed. Notably the thread name / file are available.

The exception structure is correctly split in several fields. So that when Gerrit json has:

gerrit.json
{
...
   "exception" : {
      "exception_class" : "java.util.concurrent.ExecutionException",
      "exception_message" : "org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9",
      "stacktrace" : "java.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9
     <snip for clarity>"
   },

We have:

exception.exception_classjava.util.concurrent.ExecutionException
exception.exception_messageorg.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9
exception.stacktracejava.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown c011f9d17223c4bb08ba472c6544ba12009a49d9 <snip for clarity>"

We still need:

The level / severity mapping is faulty:

gerrit.json
{
   "level" : "WARN"
}

Yields:

severitynotice
levelWARNING, %{SEVERITY_LABEL}

And:

  • type to be set to gerrit instead of `syslog
  • channel to be set to the value of logger_name

And whenever the json has been parsed and used to populate fields, we should probably dismiss the msg field which is the raw json message :]

The state is that we have logs being added to logstash. That originally has been done by having log4j to write directly to logstash then got changed to write to a json file and have our logging infrastructure to read it.

I have never used the logs in Logstash for the reasons in my previous comment (lack of proper type, no channel, the level field is off).

The json file is generated by using a specific log4j layout net.logstash.log4j.JSONEventLayoutV1 which is not part of Gerrit itself. I can't really find a way to have java to make it available when logging is initialized and currently we need rebuild the gerrit.war to include the extra lib. The intent is to remove that T268020 given Gerrit has its own Json logging format.

My idea is to drop our custom log4j config and instead use Gerrit json logging format. That simplifies the system.

Seems the messages are processed by input-file-gerrit-json which might need some adjustment to match the Gerrit format https://gerrit.wikimedia.org/r/Documentation/logs.html Seems our system will be happy with any json payload.

Change 660030 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] gerrit: drop log4j custom config

https://gerrit.wikimedia.org/r/660030

Change 660030 merged by Dzahn:
[operations/puppet@production] gerrit: drop log4j custom config

https://gerrit.wikimedia.org/r/660030

hashar claimed this task.

This was done twice. The logs have been available in logstash for a while now. I have filed a fresh new task to polish up the logging format since my previous comments got missed and it seems better to have a fresh task that is not encumbered with all the history of this task. So follow up to T274661

\o/

The remaining issues have been addressed via a sub task. The summary is T274661#6877951

Outcome: we now have a proper Kibana dashboard: https://logstash.wikimedia.org/app/dashboards#/view/AW1f-0k0ZKA

The remaining issues have been addressed via a sub task. The summary is T274661#6877951

Outcome: we now have a proper Kibana dashboard: https://logstash.wikimedia.org/app/dashboards#/view/AW1f-0k0ZKA

This link has stopped working, but it seems a similar dashboard was re-created at https://logstash.wikimedia.org/app/dashboards#/view/AW1f-0k0ZKA7RpirlnKV. If the link stops working again, search for "Gerrit" in the dashboard list.