It should be possible and simple for service owners and operations to extract metrics from logs. From those metrics then it is possible to build dashboard and alerts. For example the number of errors, or bugs, or failures.
One use case we've been working on is gauge how many times the OOM killer has been invoked for thumbor (e.g. https://gerrit.wikimedia.org/r/#/c/315272/).
The tool of choice has been mtail (https://github.com/google/mtail) as it is already packaged for Debian and supposedly efficient on resources. Also interoperability works well since mtail supports graphite/statsd/prometheus out of the box.
We could consider also logstash 'standalone' in the form of 'beats' (https://www.elastic.co/products/beats) or anything similar that does analyze logs and report metrics on found patterns.
Description
Details
Related Objects
- Mentioned In
- T176293: fundraising postfix metrics into prometheus
Event Timeline
I'll be following up with puppetization for mtail, though this is what I was able to get from wezen syslog central server with mtail as a test: (shown in prometheus format, though the same metrics will be pushed to graphite too)
# TYPE mce_events counter mce_events{prog="kernel.mtail",instance="wezen"} 1108 # TYPE cpu_throttled counter cpu_throttled{prog="kernel.mtail",instance="wezen"} 9554 # TYPE edac_events counter # TYPE upstart_respawn counter upstart_respawn{service="nova-spiceproxy",prog="kernel.mtail",instance="wezen"} 5302 upstart_respawn{service="nova-novncproxy",prog="kernel.mtail",instance="wezen"} 5078 upstart_respawn{service="ganglia-monitor",prog="kernel.mtail",instance="wezen"} 560 # TYPE oom_kill counter oom_kill{binary="convert",prog="kernel.mtail",instance="wezen"} 91 oom_kill{binary="thumbor",prog="kernel.mtail",instance="wezen"} 12 oom_kill{binary="gs",prog="kernel.mtail",instance="wezen"} 2 oom_kill{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 6 # TYPE segfault counter segfault{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 641 segfault{binary="check_sslxNN",prog="kernel.mtail",instance="wezen"} 1 segfault{binary="gs",prog="kernel.mtail",instance="wezen"} 6
With the following mtail program for kernel messages:
# Syslog decorator def syslog {/(?P<date>(?P<legacy_date>\w+\s+\d+\s+\d+:\d+:\d+)|(?P<rfc3339_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+[+-]\d{2}:\d{2}))/ + /\s+(?:\w+@)?(?P<hostname>[\w\.-]+)\s+(?P<application>[\w\.-]+)(?:\[(?P<pid>\d+)\])?:\s+(?P<message>.*)/ { len($legacy_date) > 0 { strptime($2, "Jan _2 15:04:05") } len($rfc3339_date) > 0 { strptime($rfc3339_date, "2006-01-02T03:04:05-0700") } next } } @syslog { counter mce_events counter cpu_throttled counter edac_events counter upstart_respawn by service counter oom_kill by binary counter segfault by binary /kernel: \[[\d.]+\] mce: .*Machine check events logged/ { mce_events++ } /kernel: \[[\d.]+\] .* above threshold, cpu clock throttled/ { cpu_throttled++ } /kernel: \[[\d.]+\] EDAC .*/ { edac_events++ } /kernel: \[[\d.]+\] init: (?P<service>[\S]+) .*process ended, respawning/ { upstart_respawn[$service]++ } /kernel: \[[\d.]+\] (?P<binary>[\S]+) invoked oom-killer:.*/ { oom_kill[$binary]++ } /kernel: \[[\d.]+\] (?P<binary>[^\[]+)\[\d+\]: segfault at.*/ { segfault[$binary]++ } }
Change 316543 had a related patch set uploaded (by Filippo Giunchedi):
Introduce mtail module
Change 316544 had a related patch set uploaded (by Filippo Giunchedi):
centralserver: add mtail for kernel messages
Change 316544 merged by Filippo Giunchedi:
centralserver: add mtail for kernel messages
Change 319138 had a related patch set uploaded (by Filippo Giunchedi):
mtail: introduce systemd unit
This is now done and metrics from our syslog central servers are being pushed to graphite under mtail hierarchy. I'm stalling this since we're using a patch I've submitted upstream https://github.com/google/mtail/pull/45 but hasn't been accepted/rejected yet.
Upstream patch has been merged, though there's another outstanding issue https://github.com/google/mtail/issues/56 which makes mtail crash on parse errors. We've also seen mtail stopping sending metrics after a while, tracked at https://github.com/google/mtail/issues/66
Change 447597 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: gather metrics on systemd respawns
Change 447597 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: gather metrics on systemd respawns
Change 447598 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] syslog: add systemd.mtail
Change 447598 merged by Filippo Giunchedi:
[operations/puppet@production] syslog: add systemd.mtail
Going to resolve this generic task as we have mtail deployed in production and multiple users