Page MenuHomePhabricator

mtail "syscall spam" / high cpu usage on logstash1023
Closed, ResolvedPublic

Description

Noticed mtail 3.0.0~rc35-3+wmf1 using more cpu than expected (i.e. none, logstash wasn't running) on logstash1023:

mtail     9714 57.0  0.4 1147988 37836 ?       Ssl  13:50   2:02 /usr/bin/mtail -progs /etc/mtail -logtostderr -logs /var/log/logstash/logstash-plain.log -port 3903 -disable_fsnotify

A strace reveals a log of newfstatat calls for files that in theory mtail shouldn't look at at all (afaiui)

[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.16045.log", {st_mode=S_IFREG|0644, st_size=13420, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.2277.log", {st_mode=S_IFREG|0644, st_size=18636, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.23919.log", {st_mode=S_IFREG|0644, st_size=13420, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.24170.log", {st_mode=S_IFREG|0644, st_size=13420, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.15953.log.6", {st_mode=S_IFREG|0644, st_size=20023, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.32691.log", {st_mode=S_IFREG|0644, st_size=19987, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.18967.log", {st_mode=S_IFREG|0644, st_size=10691, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.15127.log", {st_mode=S_IFREG|0644, st_size=9561, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.13193.log", {st_mode=S_IFREG|0644, st_size=19577, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.24227.log", {st_mode=S_IFREG|0644, st_size=9561, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.31937.log.1", {st_mode=S_IFREG|0644, st_size=10477, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.5066.log", {st_mode=S_IFREG|0644, st_size=13425, ...}, 0) = 0
[pid  9726] newfstatat(AT_FDCWD, "/var/log/logstash/logstash_jvm_gc.11117.log", {st_mode=S_IFREG|0644, st_size=18593, ...}, 0) = 0

Event Timeline

I cleared /var/log/logstash of old gc logs because I'm running performance tests there, although the same behavior can be observed on logstash1024

Submitted a bug upstream.

I am also working on a patch while we wait to hear from upstream.

By raising the log verbosity with -v=2 the stat calls are shown in the log. I gave it a try with mtail 3.0.0~rc35-3+wmf1 shipped in docker-registry.wikimedia.org/releng/operations-puppet. Once in the container:

mkdir -p /var/log/logstash
touch /var/log/logstash/{main,1,2,3,4,5}.log
mtail -logtostderr -logs /var/log/logstash/main.log -disable_fsnotify -progs /etc/mtail -v=2
I0620 13:31:00.633711     103 main.go:102] mtail version 3.0.0~rc35 git revision 3.0.0~rc35-3+wmf1 go version go1.11.6 go arch amd64 go os linux
I0620 13:31:00.633926     103 log_watcher.go:57] fsnotify disabled and no poll interval specified; defaulting to 250ms poll
# It polls every 250ms

I0620 13:31:00.636499     103 mtail.go:99] Tail pattern "/var/log/logstash/main.log"
I0620 13:31:00.636546     103 tail.go:134] AddPattern: /var/log/logstash/main.log
# Watch main.log as passed with -logs

I0620 13:31:00.636592     103 log_watcher.go:269] Adding a watch on resolved path "/var/log/logstash"
I0620 13:31:00.636630     103 log_watcher.go:249] No abspath in watched list, added new one for /var/log/logstash
# Watch the parent directory as well bah

I0620 13:31:00.885297     103 log_watcher.go:143] Update fi
I0620 13:31:00.885328     103 log_watcher.go:128] Stat "/var/log/logstash"
I0620 13:31:00.885836     103 log_watcher.go:170] sending create for /var/log/logstash/1.log
I0620 13:31:00.885898     103 tail.go:234] No file handle found for "/var/log/logstash/1.log", but is being watched
I0620 13:31:00.885936     103 tail.go:321] "/var/log/logstash/1.log" did not match pattern "/var/log/logstash/main.log"
I0620 13:31:00.885963     103 tail.go:341] did not start tailing "/var/log/logstash/1.log"
I0620 13:31:00.885991     103 tail.go:244] Internal error finding file handle for "/var/log/logstash/1.log" after create

So it looks like it watch the holding directory in case -logs is a pattern and has to match a file in there (eg one could use: -logs /var/log/logstash/*.log, so it gotta look at the list of files in the directory).

Change 607144 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/mtail@cross_dist_build] Filter the files watched for modifications to only those configured by -logs and -progs.

https://gerrit.wikimedia.org/r/607144

colewhite triaged this task as Medium priority.Jun 23 2020, 10:20 PM

Change 607144 merged by Cwhite:
[operations/debs/mtail@cross_dist_build] Filter the files watched for modifications to only those configured by -logs and -progs.

https://gerrit.wikimedia.org/r/607144

Mentioned in SAL (#wikimedia-operations) [2020-06-24T21:45:28Z] <shdubsh> install mtail 3.0.0~rc35+wmf2 on logstash1007 - T255776

Change 607630 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mtail: ensure package present and set logstash1007 mtail::from_component

https://gerrit.wikimedia.org/r/607630

Change 607630 merged by Cwhite:
[operations/puppet@production] mtail: set logstash1007 to install mtail from_component

https://gerrit.wikimedia.org/r/607630

After installing +wmf2 on logstash1007 CPU usage appears to max around 1.3% as opposed to around 4% on +wmf1.

Mentioned in SAL (#wikimedia-operations) [2020-06-29T18:55:33Z] <shdubsh> test mtail rc35+wmf2 on cp5001 - T255776

Change 608450 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: install mtail from component in codfw and eqsin

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608450

Change 608450 merged by Cwhite:
[operations/puppet@production] hiera: install mtail from component in codfw and eqsin

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608450

Change 608721 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608721

Change 608721 merged by Cwhite:
[operations/puppet@production] mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet

https://gerrit.wikimedia.org/r/608721

+wmf2 has been deployed

Hi @dancy!

I'm not inclined to upstream the patch. The patch is a terrible, terrible hack that happens to fit our use case(s). It is very likely they would not want it as-is (it adds a dependency) and it might break the file rotation handling feature in subtle ways.

We're using Mtail in a way that bypasses the potential file rotation issues.