Page MenuHomePhabricator

log spam from mtail 3.0.0~rc19 on wezen
Open, NormalPublic

Description

It looks like mtail is spamming wezen logs in /var/log to the point of filling its / filesystem:

Jun 12 10:44:42 wezen mtail[20297]: I0612 10:44:42.587135   20297 file.go:168] file type does not support deadline

This is likely due to the latest upgrade of mtail to workaround another issue @jbond found. For this issue @ArielGlenn spotted the relevant issue/commit is https://github.com/google/mtail/issues/206 https://github.com/google/mtail/commit/b73b1836fa568a6d5b40029dd9c88b02078df1db

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2019, 10:46 AM

Fixed versions for mtail: v3.0.0-rc30 v3.0.0-rc29 v3.0.0-rc28 v3.0.0-rc27 v3.0.0-rc26 v3.0.0-rc25 v3.0.0-rc24 v3.0.0-rc23 v3.0.0-rc22 v3.0.0-rc21 so upgrading to unstable version of mtail would do it (3.0.0~rc24.1-1)

fgiunchedi updated the task description. (Show Details)Jun 12 2019, 10:48 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-12T10:52:51Z] <godog> force-upgrade mtail to 3.0.0~rc24.1-1 on wezen - T225604

Ok I've temporarily fixed the problem by installing a locally modified version of mtail from unstable. Good enough for now IMHO, the bigger plan here is to upgrade central syslog servers to Stretch or Buster (T200706: rack/setup/install centrallog1001.eqiad.wmnet and T224564: Reimage wezen to Stretch (and rename to centrallog2001)). Leaving open because we'll need the unstable version of mtail anyways even on the reimaged servers.

Mentioned in SAL (#wikimedia-operations) [2019-06-17T12:37:02Z] <jbond42> upgrade mtail on lithium - T225604

I have added mtail_3.0.0~rc24.1-1+wmf1_amd64.deb to the jessie-wikimedia backports repository and upgraded lithium, as well.

Mentioned in SAL (#wikimedia-operations) [2019-07-08T12:52:13Z] <godog> copy mtail to buster-wikimedia - T225604

fgiunchedi moved this task from Backlog to Radar on the observability board.Jul 8 2019, 1:08 PM
Dzahn added a subscriber: Dzahn.EditedJul 10 2019, 10:35 PM

on cp4031 the varnish-mtail service fails to start with:

17:32 < mutante> cp4031 varnishmtail-backend[12858]: flag provided but not defined: -logfds

This seems to be due to the latest update. Unsure why it's only on 1 ulsfo host and not on all though.

We might lose metrics if this gets restarted on all cp4 hosts.

Mentioned in SAL (#wikimedia-operations) [2019-07-10T22:46:39Z] <shdubsh> downgrading cp4031 to mtail_3.0.0~rc5-1~bpo9+1wmf1 to fix varnishmtail T225604

@jbond there appears to be a missing feature around the -logfds flag in mtail rc24. This causes varnishmtail and varnishmtail-backend to fail to start because the -logfds flag is missing. Downgrading back to 3.0.0~rc5-1~bpo9+1wmf1 restores the flag and varnishmtail and varnishmtail-backend worked again.

I downgraded cp4031 manually because it was not exporting metrics, but other nodes are apt upgraded but varnishmtail/varnishmtail-backend haven't been restarted and are still running rc5. If these services were to restart, the problem will surface again.

This could be high-impact because varnishmtail is where we get a lot of varnish cache metrics.

Mentioned in SAL (#wikimedia-operations) [2019-07-11T08:14:11Z] <ema> cp-ulsfo: downgrade mtail to 3.0.0~rc5-1~bpo9+1 to fix varnishmtail-backend T225604

ema triaged this task as Normal priority.Jul 11 2019, 8:42 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-11T08:51:15Z] <godog> upload mtail 3.0.0~rc5-1~bpo9+1wmf1 to stretch-wikimedia - T225604

For the time being I've uploaded rc5 (same version as stretch-backports) to stretch-wikimedia too, since we know it works with -logfd until we have a working solution for varnish.

buster-wikimedia doesn't have mtail internally, thus it'll get 3.0.0~rc19-2 from debian, while jessie-wikimedia has 3.0.0~rc24.1-1+wmf1 although it should affect only hosts that don't use -logfd anyways.

jbond added a comment.Jul 11 2019, 2:58 PM

It seems this flag was removed from upstream just over a year ago. i have pinged fgiunchedi original issue requesting this feature to ask if there is more context around why it got removed.

Change 523739 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] varnishmtail: use -logs /dev/stdin instead of -logfds 0

https://gerrit.wikimedia.org/r/523739

Change 523739 merged by Jbond:
[operations/puppet@production] varnishmtail: use -logs /dev/stdin instead of -logfds 0

https://gerrit.wikimedia.org/r/523739

I have now pushed a patch so so that mtail uses -logs /dev/stdin instead of -logfds 0. would you like me to upgrade everything to mtail_3.0.0~rc24.1-1+wmf1_amd64.deb again?

I have now pushed a patch so so that mtail uses -logs /dev/stdin instead of -logfds 0. would you like me to upgrade everything to mtail_3.0.0~rc24.1-1+wmf1_amd64.deb again?

Yes please, thank you! Should be good to go now.

jbond added a comment.Jul 18 2019, 1:32 PM

Unfortunately this had to be downgraded again. We saw the following error with the new version

Jul 18 13:16:02 cp3030 varnishmtail[31151]: panic: runtime error: index out of range
Jul 18 13:16:02 cp3030 varnishmtail[31151]: goroutine 39 [running]:
Jul 18 13:16:02 cp3030 varnishmtail[31151]: bytes.(*Buffer).WriteByte(0xc0002ef6c0, 0xc000550178, 0x0, 0x0)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         /usr/lib/go-1.11/src/bytes/buffer.go:271 +0xa8
Jul 18 13:16:02 cp3030 varnishmtail[31151]: bytes.(*Buffer).WriteRune(0xc0002ef6c0, 0x78, 0xeda, 0x78, 0x1)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         /usr/lib/go-1.11/src/bytes/buffer.go:281 +0x14c
Jul 18 13:16:02 cp3030 varnishmtail[31151]: github.com/google/mtail/internal/tailer.(*File).Read(0xc0000cdce0, 0xc0000aedd8, 0x1)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/file.go:200 +0xf9
Jul 18 13:16:02 cp3030 varnishmtail[31151]: github.com/google/mtail/internal/tailer.(*File).Follow(0xc0000cdce0, 0xc0002aee98, 0xc0002aee98)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/file.go:144 +0x2e3
Jul 18 13:16:02 cp3030 varnishmtail[31151]: github.com/google/mtail/internal/tailer.doFollow(0xc0000cdce0)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/tail.go:204 +0x2f
Jul 18 13:16:02 cp3030 varnishmtail[31151]: github.com/google/mtail/internal/tailer.(*Tailer).handleLogEvent(0xc0002ef650, 0x7fffdbe43f21, 0xa)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/tail.go:199 +0x10a
Jul 18 13:16:02 cp3030 varnishmtail[31151]: github.com/google/mtail/internal/tailer.(*Tailer).run(0xc0002ef650, 0xc0001caba0)
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/tail.go:285 +0x177
Jul 18 13:16:02 cp3030 varnishmtail[31151]: created by github.com/google/mtail/internal/tailer.New
Jul 18 13:16:02 cp3030 varnishmtail[31151]:         github.com/google/mtail/internal/tailer/tail.go:80 +0x193
jbond added a comment.EditedJul 18 2019, 2:21 PM

This error is triggered slower on some machines vs others

  • cp2026 (cache::upload): takes a bit longer to die ~ 5 minutes
  • cp1085 (cache::text): dies instantly
  • cp1075 (cache::text): dies instantly
  • cp1076 (cache::upload): dies instantly