Page MenuHomePhabricator

Extract metrics from logs
Closed, ResolvedPublic

Description

It should be possible and simple for service owners and operations to extract metrics from logs. From those metrics then it is possible to build dashboard and alerts. For example the number of errors, or bugs, or failures.
One use case we've been working on is gauge how many times the OOM killer has been invoked for thumbor (e.g. https://gerrit.wikimedia.org/r/#/c/315272/).
The tool of choice has been mtail (https://github.com/google/mtail) as it is already packaged for Debian and supposedly efficient on resources. Also interoperability works well since mtail supports graphite/statsd/prometheus out of the box.
We could consider also logstash 'standalone' in the form of 'beats' (https://www.elastic.co/products/beats) or anything similar that does analyze logs and report metrics on found patterns.

Event Timeline

I'll be following up with puppetization for mtail, though this is what I was able to get from wezen syslog central server with mtail as a test: (shown in prometheus format, though the same metrics will be pushed to graphite too)

# TYPE mce_events counter                                                                                  mce_events{prog="kernel.mtail",instance="wezen"} 1108
# TYPE cpu_throttled counter
cpu_throttled{prog="kernel.mtail",instance="wezen"} 9554
# TYPE edac_events counter
# TYPE upstart_respawn counter
upstart_respawn{service="nova-spiceproxy",prog="kernel.mtail",instance="wezen"} 5302
upstart_respawn{service="nova-novncproxy",prog="kernel.mtail",instance="wezen"} 5078
upstart_respawn{service="ganglia-monitor",prog="kernel.mtail",instance="wezen"} 560
# TYPE oom_kill counter
oom_kill{binary="convert",prog="kernel.mtail",instance="wezen"} 91
oom_kill{binary="thumbor",prog="kernel.mtail",instance="wezen"} 12
oom_kill{binary="gs",prog="kernel.mtail",instance="wezen"} 2
oom_kill{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 6
# TYPE segfault counter
segfault{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 641
segfault{binary="check_sslxNN",prog="kernel.mtail",instance="wezen"} 1
segfault{binary="gs",prog="kernel.mtail",instance="wezen"} 6

With the following mtail program for kernel messages:

# Syslog decorator
def syslog {/(?P<date>(?P<legacy_date>\w+\s+\d+\s+\d+:\d+:\d+)|(?P<rfc3339_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+[+-]\d{2}:\d{2}))/ +
    /\s+(?:\w+@)?(?P<hostname>[\w\.-]+)\s+(?P<application>[\w\.-]+)(?:\[(?P<pid>\d+)\])?:\s+(?P<message>.*)/ {
        len($legacy_date) > 0 {
            strptime($2, "Jan _2 15:04:05")
        }
        len($rfc3339_date) > 0 {
            strptime($rfc3339_date, "2006-01-02T03:04:05-0700")
        }
        next
    }
}

@syslog {
counter mce_events
counter cpu_throttled
counter edac_events
counter upstart_respawn by service
counter oom_kill by binary
counter segfault by binary

/kernel: \[[\d.]+\] mce: .*Machine check events logged/ {
  mce_events++
}

/kernel: \[[\d.]+\] .* above threshold, cpu clock throttled/ {
  cpu_throttled++
}

/kernel: \[[\d.]+\] EDAC .*/ {
  edac_events++
}

/kernel: \[[\d.]+\] init: (?P<service>[\S]+) .*process ended, respawning/ {
  upstart_respawn[$service]++
}

/kernel: \[[\d.]+\] (?P<binary>[\S]+) invoked oom-killer:.*/ {
  oom_kill[$binary]++
}

/kernel: \[[\d.]+\] (?P<binary>[^\[]+)\[\d+\]: segfault at.*/ {
  segfault[$binary]++
}

}

Change 316543 had a related patch set uploaded (by Filippo Giunchedi):
Introduce mtail module

https://gerrit.wikimedia.org/r/316543

Change 316544 had a related patch set uploaded (by Filippo Giunchedi):
centralserver: add mtail for kernel messages

https://gerrit.wikimedia.org/r/316544

elukey triaged this task as Medium priority.Oct 19 2016, 12:35 PM
elukey awarded a token.

Change 316543 merged by Filippo Giunchedi:
Introduce mtail module

https://gerrit.wikimedia.org/r/316543

Change 316544 merged by Filippo Giunchedi:
centralserver: add mtail for kernel messages

https://gerrit.wikimedia.org/r/316544

Change 319138 had a related patch set uploaded (by Filippo Giunchedi):
mtail: introduce systemd unit

https://gerrit.wikimedia.org/r/319138

Change 319138 merged by Filippo Giunchedi:
mtail: introduce systemd unit

https://gerrit.wikimedia.org/r/319138

fgiunchedi changed the task status from Open to Stalled.Nov 1 2016, 10:03 PM

This is now done and metrics from our syslog central servers are being pushed to graphite under mtail hierarchy. I'm stalling this since we're using a patch I've submitted upstream https://github.com/google/mtail/pull/45 but hasn't been accepted/rejected yet.

fgiunchedi changed the task status from Stalled to Open.EditedMar 16 2017, 9:54 AM
fgiunchedi added a project: User-fgiunchedi.

Upstream patch has been merged, though there's another outstanding issue https://github.com/google/mtail/issues/56 which makes mtail crash on parse errors. We've also seen mtail stopping sending metrics after a while, tracked at https://github.com/google/mtail/issues/66

Update: with latest upstream git of mtail things seem stable so far.

Change 447597 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: gather metrics on systemd respawns

https://gerrit.wikimedia.org/r/447597

Change 447597 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: gather metrics on systemd respawns

https://gerrit.wikimedia.org/r/447597

Change 447598 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] syslog: add systemd.mtail

https://gerrit.wikimedia.org/r/447598

Change 447598 merged by Filippo Giunchedi:
[operations/puppet@production] syslog: add systemd.mtail

https://gerrit.wikimedia.org/r/447598

fgiunchedi claimed this task.

Going to resolve this generic task as we have mtail deployed in production and multiple users