Page MenuHomePhabricator

fastnetmon spamming /var/log on netflow hosts leading to disk saturation
Closed, ResolvedPublic

Description

After following up on netflow2001's disk usage alert I discovered that logs are spammed by the following (over and over, even on netflow1001):

Dec 13 08:26:31 netflow2001 fastnetmon: 1576225591 INFO  : Received ipfix options flowset id, which is not supported
Dec 13 08:26:31 netflow2001 fastnetmon: 1576225591 INFO  : We don't have a template for flowset_id: 513 but it's not an error if this message disappears in 5-10 seconds. We need some time to learn it!

From a quick search I found https://github.com/pavel-odintsov/fastnetmon/issues/618, not sure if related or not (talks about sampling that we do use, so possibly).

One issue is that fastnetmon's service unit doesn't have any SyslogIdentifier field, so all logs go to /var/log/user.log/messages/syslog/etc.. A spam amplifier basically :)

Event Timeline

This issue is fixed in Fastnetmon 1.1.4: https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.1.4

Suppressed excessive logging about missing IPFIX options

Unfortunately only 1.1.3 is in Debian, so we either need to package 1.1.4 (which might not be too difficult: https://tracker.debian.org/pkg/fastnetmon) or hide the specific log line.

About the SyslogIdentifier I can send a PR to add it to: https://github.com/pavel-odintsov/fastnetmon/blob/master/src/scripts/build_any_package.pl#L412 and "temporarily" have Puppet manage the systemd file.
As I'm not that familiar with Syslog, let me know if it's worth doing it.

@ayounsi how prioritary would you say this ticket is worth? Spam is annoying but shouldn't have high- however the disk saturation could be dangerous (I don't have all context to be able to decide).

ayounsi triaged this task as Medium priority.Dec 28 2019, 9:15 AM

Does logroate deletes logs older than a certain age? If not that might be a good option, in addition to my comment above for a more permanent fix.

I'm not able to reproduce though, so might be able to wait until after the break.

ayounsi@netflow1001:~$ ls -alh /var/log/fastnet*
-rw-rw-rw- 1 root root  14M Dec 28 09:09 /var/log/fastnetmon.log
-rw-rw-rw- 1 root root  36M Dec 28 00:00 /var/log/fastnetmon.log.1
-rw-rw-rw- 1 root root 697K Dec 19 00:00 /var/log/fastnetmon.log.10.gz
-rw-rw-rw- 1 root root 694K Dec 18 00:00 /var/log/fastnetmon.log.11.gz
-rw-rw-rw- 1 root root 702K Dec 17 00:00 /var/log/fastnetmon.log.12.gz
-rw-rw-rw- 1 root root 704K Dec 16 00:00 /var/log/fastnetmon.log.13.gz
-rw-rw-rw- 1 root root 700K Dec 15 00:00 /var/log/fastnetmon.log.14.gz
-rw-rw-rw- 1 root root 703K Dec 14 00:00 /var/log/fastnetmon.log.15.gz
-rw-rw-rw- 1 root root 697K Dec 27 00:00 /var/log/fastnetmon.log.2.gz
-rw-rw-rw- 1 root root 698K Dec 26 00:00 /var/log/fastnetmon.log.3.gz
-rw-rw-rw- 1 root root 699K Dec 25 00:00 /var/log/fastnetmon.log.4.gz
-rw-rw-rw- 1 root root 700K Dec 24 00:00 /var/log/fastnetmon.log.5.gz
-rw-rw-rw- 1 root root 700K Dec 23 00:00 /var/log/fastnetmon.log.6.gz
-rw-rw-rw- 1 root root 699K Dec 22 00:00 /var/log/fastnetmon.log.7.gz
-rw-rw-rw- 1 root root 700K Dec 21 00:00 /var/log/fastnetmon.log.8.gz
-rw-rw-rw- 1 root root 697K Dec 20 00:00 /var/log/fastnetmon.log.9.gz
ayounsi@netflow1001:~$ ls -alh /var/log/user*
-rw-r----- 1 root adm 272M Dec 28 09:11 /var/log/user.log
-rw-r----- 1 root adm 299M Dec 22 00:00 /var/log/user.log.1
-rw-r----- 1 root adm  22M Dec 15 00:00 /var/log/user.log.2.gz
-rw-r----- 1 root adm  28M Dec  8 00:00 /var/log/user.log.3.gz
-rw-r----- 1 root adm 3.6M Dec  1 00:00 /var/log/user.log.4.gz

Note that Fastnetmon 1.1.4 is in Debian 11. Back-porting it might be easy.

Mentioned in SAL (#wikimedia-operations) [2020-04-02T16:19:28Z] <XioNoX> upgrade netflow4001's fastnetmon to 1.1.4 - T240658

Thanks to Moritz I backported (locally only for now) FNM 1.1.4 and installed it on netflow4001.
Unpacking fastnetmon (1.1.4-1~deb10u1) over (1.1.3+dfsg-8.1) ...
but

$ /usr/sbin/fastnetmon --version
Version: 1.1.3 master git-

However the log Received ipfix options flowset id, which is not supported is not showing up anymore.
But:
fastnetmon: [INFO] We don't have a template for flowset_id: 513 but it's not an error if this message disappears in 5-10 seconds. We need some time to learn it! is still there.

So I think it did get upgraded as expected.

Since this task was created, another soul is having the same issue, see https://github.com/pavel-odintsov/fastnetmon/issues/792
for 1.1.4 as well.
So the plan is to finish the upgrade and monitor the task.

Is there something else we can do to filter out that message?

Mentioned in SAL (#wikimedia-operations) [2020-04-03T06:55:05Z] <XioNoX> add fastnetmon 1.1.4 to buster-wikimedia - T240658

ayounsi changed the task status from Open to Stalled.Apr 3 2020, 7:05 AM
ayounsi lowered the priority of this task from Medium to Low.

All netflow hosts are now running FNM 1.1.4.

Now waiting for upstream.

Change 609426 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Netflow: send as little options templates as possible

https://gerrit.wikimedia.org/r/609426

Mentioned in SAL (#wikimedia-sre) [2020-07-08T12:17:37Z] <xionox-tmp> rollout less frequent option-refresh-rate - T240658

Change 609426 merged by jenkins-bot:
[operations/homer/public@master] Netflow: send as little options templates as possible

https://gerrit.wikimedia.org/r/609426

ayounsi claimed this task.

Confirmed that the logs are now quieter, from ~42/min to ~4/10min right now in esams.

Which I think is good enough as there is not much more that can be done. Please re-open if you think otherwise.