Extract metrics from logs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Oct 12 2016, 11:46 AM

Description

It should be possible and simple for service owners and operations to extract metrics from logs. From those metrics then it is possible to build dashboard and alerts. For example the number of errors, or bugs, or failures.
One use case we've been working on is gauge how many times the OOM killer has been invoked for thumbor (e.g. https://gerrit.wikimedia.org/r/#/c/315272/).
The tool of choice has been mtail (https://github.com/google/mtail) as it is already packaged for Debian and supposedly efficient on resources. Also interoperability works well since mtail supports graphite/statsd/prometheus out of the box.
We could consider also logstash 'standalone' in the form of 'beats' (https://www.elastic.co/products/beats) or anything similar that does analyze logs and report metrics on found patterns.

Details

Subject	Repo	Branch	Lines +/-
syslog: add systemd.mtail	operations/puppet	production	+5 -0
mtail: gather metrics on systemd respawns	operations/puppet	production	+71 -0
mtail: introduce systemd unit	operations/puppet	production	+19 -7
centralserver: add mtail for kernel messages	operations/puppet	production	+10 -0
Introduce mtail module	operations/puppet	production	+142 -0

Customize query in gerrit

Related Objects

Mentioned In: T176293: fundraising postfix metrics into prometheus

Event Timeline

fgiunchedi created this task.Oct 12 2016, 11:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2016, 11:46 AM

I'll be following up with puppetization for mtail, though this is what I was able to get from wezen syslog central server with mtail as a test: (shown in prometheus format, though the same metrics will be pushed to graphite too)

# TYPE mce_events counter                                                                                  mce_events{prog="kernel.mtail",instance="wezen"} 1108
# TYPE cpu_throttled counter
cpu_throttled{prog="kernel.mtail",instance="wezen"} 9554
# TYPE edac_events counter
# TYPE upstart_respawn counter
upstart_respawn{service="nova-spiceproxy",prog="kernel.mtail",instance="wezen"} 5302
upstart_respawn{service="nova-novncproxy",prog="kernel.mtail",instance="wezen"} 5078
upstart_respawn{service="ganglia-monitor",prog="kernel.mtail",instance="wezen"} 560
# TYPE oom_kill counter
oom_kill{binary="convert",prog="kernel.mtail",instance="wezen"} 91
oom_kill{binary="thumbor",prog="kernel.mtail",instance="wezen"} 12
oom_kill{binary="gs",prog="kernel.mtail",instance="wezen"} 2
oom_kill{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 6
# TYPE segfault counter
segfault{binary="rsvg-convert",prog="kernel.mtail",instance="wezen"} 641
segfault{binary="check_sslxNN",prog="kernel.mtail",instance="wezen"} 1
segfault{binary="gs",prog="kernel.mtail",instance="wezen"} 6

With the following mtail program for kernel messages:

# Syslog decorator
def syslog {/(?P<date>(?P<legacy_date>\w+\s+\d+\s+\d+:\d+:\d+)|(?P<rfc3339_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+[+-]\d{2}:\d{2}))/ +
    /\s+(?:\w+@)?(?P<hostname>[\w\.-]+)\s+(?P<application>[\w\.-]+)(?:\[(?P<pid>\d+)\])?:\s+(?P<message>.*)/ {
        len($legacy_date) > 0 {
            strptime($2, "Jan _2 15:04:05")
        }
        len($rfc3339_date) > 0 {
            strptime($rfc3339_date, "2006-01-02T03:04:05-0700")
        }
        next
    }
}

@syslog {
counter mce_events
counter cpu_throttled
counter edac_events
counter upstart_respawn by service
counter oom_kill by binary
counter segfault by binary

/kernel: \[[\d.]+\] mce: .*Machine check events logged/ {
  mce_events++
}

/kernel: \[[\d.]+\] .* above threshold, cpu clock throttled/ {
  cpu_throttled++
}

/kernel: \[[\d.]+\] EDAC .*/ {
  edac_events++
}

/kernel: \[[\d.]+\] init: (?P<service>[\S]+) .*process ended, respawning/ {
  upstart_respawn[$service]++
}

/kernel: \[[\d.]+\] (?P<binary>[\S]+) invoked oom-killer:.*/ {
  oom_kill[$binary]++
}

/kernel: \[[\d.]+\] (?P<binary>[^\[]+)\[\d+\]: segfault at.*/ {
  segfault[$binary]++
}

}

Change 316543 had a related patch set uploaded (by Filippo Giunchedi):
Introduce mtail module

https://gerrit.wikimedia.org/r/316543

Change 316544 had a related patch set uploaded (by Filippo Giunchedi):
centralserver: add mtail for kernel messages

https://gerrit.wikimedia.org/r/316544

elukey subscribed.Oct 18 2016, 2:56 PM

elukey triaged this task as Medium priority.Oct 19 2016, 12:35 PM

elukey awarded a token.

Change 316543 merged by Filippo Giunchedi:
Introduce mtail module

https://gerrit.wikimedia.org/r/316543

Change 316544 merged by Filippo Giunchedi:
centralserver: add mtail for kernel messages

https://gerrit.wikimedia.org/r/316544

Change 319138 had a related patch set uploaded (by Filippo Giunchedi):
mtail: introduce systemd unit

https://gerrit.wikimedia.org/r/319138

Change 319138 merged by Filippo Giunchedi:
mtail: introduce systemd unit

https://gerrit.wikimedia.org/r/319138

This is now done and metrics from our syslog central servers are being pushed to graphite under mtail hierarchy. I'm stalling this since we're using a patch I've submitted upstream https://github.com/google/mtail/pull/45 but hasn't been accepted/rejected yet.

• ema subscribed.Feb 23 2017, 12:45 PM

Upstream patch has been merged, though there's another outstanding issue https://github.com/google/mtail/issues/56 which makes mtail crash on parse errors. We've also seen mtail stopping sending metrics after a while, tracked at https://github.com/google/mtail/issues/66

Update: with latest upstream git of mtail things seem stable so far.

fgiunchedi mentioned this in T176293: fundraising postfix metrics into prometheus.Sep 20 2017, 10:06 AM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Mar 28 2018, 3:49 PM

Change 447597 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: gather metrics on systemd respawns

https://gerrit.wikimedia.org/r/447597

Change 447597 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: gather metrics on systemd respawns

https://gerrit.wikimedia.org/r/447597

Change 447598 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] syslog: add systemd.mtail

https://gerrit.wikimedia.org/r/447598

Change 447598 merged by Filippo Giunchedi:
[operations/puppet@production] syslog: add systemd.mtail

https://gerrit.wikimedia.org/r/447598

fgiunchedi moved this task from Up next to Backlog on the User-fgiunchedi board.Jan 2 2019, 12:02 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:51 PM

Going to resolve this generic task as we have mtail deployed in production and multiple users

Extract metrics from logsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Extract metrics from logs
Closed, ResolvedPublic
Actions