Page MenuHomePhabricator

Make Thumbor logs available in ELK
Closed, ResolvedPublic

Description

Maybe not the very chatty 404 log, but at least the error log.

Event Timeline

@bd808 any tips or examples on how to set that up for a new service?

Generally you need to look at the logging stack that exists for the app and figure out how to ship a copy of the logs to a logstash ingestion node. For a Python app you might want to look at Striker's use of python-logstash. Once the logs are getting sent over to logstash you may need to adjust filters on the particular input channel you use to mutate the log events and tag them for saving in the Elasticsearch backend.

This all looks very straightforward. I just need to make a Debian package for python-logstash first, since there isn't one at the moment.

Debian package ready at https://github.com/gi11es/thumbor-debian/tree/master/python-logstash @fgiunchedi please review and add to our production apt repos when you can.

"Likely" wasn't so likely, anyways I've uploaded the package to our jessie-wikimedia repo!

@Gilles FYI I've imported the repo into operations/debs/python-logstash internally

Change 364954 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Send Thumbor logs to logstash

https://gerrit.wikimedia.org/r/364954

Change 364954 merged by jenkins-bot:
[mediawiki/vagrant@master] Send Thumbor logs to logstash

https://gerrit.wikimedia.org/r/364954

Change 365619 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Send Thumbor error log to logstash

https://gerrit.wikimedia.org/r/365619

Change 372163 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Filter Thumbor logstash entries by program, not type

https://gerrit.wikimedia.org/r/372163

Change 372163 merged by jenkins-bot:
[mediawiki/vagrant@master] Filter Thumbor logstash entries by program, not type

https://gerrit.wikimedia.org/r/372163

Change 375859 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Send Thumbor error log to logstash

https://gerrit.wikimedia.org/r/375859

Change 365619 merged by Filippo Giunchedi:
[operations/puppet@production] Add logback filter for Thumbor

https://gerrit.wikimedia.org/r/365619

Change 375859 merged by Filippo Giunchedi:
[operations/puppet@production] Send Thumbor error log to logstash

https://gerrit.wikimedia.org/r/375859

All patches merged, though this isn't going to work yet:

  • The sent lines might exceed MTU and are not natively split or trimmed
  • The format of lines sent by python-logstash is json for udp and "json_lines" (as logstash calls it) though the logstash udp endpoint we're sending to expects syslog format

I think the endpoint is expecting the right format:

logstash::input::udp { 'logback':
    port  => 11514,
    codec => 'json',
}

Indeed the logstash configuration is right, I misread 11514 and 10514. The udp packets are making it to e.g. logstash1003 fine but for some reason are not showing up in kibana. I can't find anything in logstash logs themselves on the filesystem

Change 375993 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Remove unneeded Thumbor logstash filter

https://gerrit.wikimedia.org/r/375993

Change 375993 merged by jenkins-bot:
[mediawiki/vagrant@master] Remove unneeded Thumbor logstash filter

https://gerrit.wikimedia.org/r/375993

Indeed, even after removing the filter the situation didn't change. I can see packets e.g. from thumbor1001 making it to logstash1003 but nothing in kibana and no errors logged by logstash either. Maybe @Gehel you've come across this before?

Usually logstash will log a save failure if Elasticsearch is rejecting the events for some reason, but it might be worth double checking the Elasticsearch logs on logstash100[456] for errors about conflicting data types in the mappings (T150106).

Thanks @bd808 ! Sadly I can't find any smoking gun on /var/log/elasticsearch/production-logstash-eqiad.log about indexing problems and conflicting data types which would have solved the mystery :(

Change 377220 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: add logstash config for codfw

https://gerrit.wikimedia.org/r/377220

Change 377220 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: add logstash config for codfw

https://gerrit.wikimedia.org/r/377220

Change 377731 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: send errors to logstash in beta too

https://gerrit.wikimedia.org/r/377731

Change 377731 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: send errors to logstash in beta too

https://gerrit.wikimedia.org/r/377731

I poked at this some more this week but go nowhere either in beta or production, to recap:

  • udp messages make it from thumbor machines to logstash frontends
  • said messages are not terminated with newlines and thus to be used with codec json, not json_lines and match logstash configuration for 11514/udp
  • the thumbor messages never make it to kibana, similar messages (e.g. wdqs) do however
  • I couldn't find any errors from elasticsearch indicating a mapping mismatch (or from logstash for that matter)

Change 380942 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Thumbor: fix logstash message type

https://gerrit.wikimedia.org/r/380942

Change 380943 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Thumbor: don’t rewrite host value in logstash messages

https://gerrit.wikimedia.org/r/380943

Change 380942 merged by Filippo Giunchedi:
[operations/puppet@production] Thumbor: fix logstash message type

https://gerrit.wikimedia.org/r/380942

@fgiunchedi it's logstash that needs to be restarted, the change is on the consuming side

Mentioned in SAL (#wikimedia-operations) [2017-10-12T13:12:59Z] <moritzm> rolling restart of logstash to pick up thumbor logs (T150734)

https://gerrit.wikimedia.org/r/#/c/380943/ has actually not been merged. I have not followed all the details, but since the logs are flowing, should we abandon this second change?

The logs are flowing, but the host information is missing. It needs to be merged and deployed.

Change 380943 merged by Filippo Giunchedi:
[operations/puppet@production] Thumbor: don't rewrite host value in logstash messages

https://gerrit.wikimedia.org/r/380943

Last change merged and deployed, thumbor hostname is now in logstash \o/