Page MenuHomePhabricator

Allow sampling of Logstash events
Closed, ResolvedPublic

Description

$wmgMonologChannels has a sample property but using it automatically disables sending the events to Logstash. Sampling would be a step up from the much less controllable throttling mechanism of T395899: Some subset of MediaWiki Logstash events are capped to 100/s, so we should fix that.

Event Timeline

Change #1153363 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/mediawiki-config@master] logging: Allow sampling of Logstash logs

https://gerrit.wikimedia.org/r/1153363

Sampling would be a step up from the much less controllable throttling mechanism of T395899, so we should fix that.

Can you say more explicitly what the intended gain is?

I think there's a hypothetical scenario, different from ours, where application-side sampling could improve visibility. For example, let's imagine that the Logstash-side sampling were indiscriminate at the level of an entire channel, or server, or wiki, or something else crude. That would mean when we hit that limit due to one very noisy message, we lose a bunch of valuable insight in other channels for a few minutes until the throttle rolls over. In that scenario, identifying the noisy channel and throttling it ourselves, gives us back some budget to spend on (or, not lose) other messages.

The scenario we have, as described at T395899, seems to be that the Logstash-side throttling is already at the most granular or "fair" level: per unique message. So it would seem, if we implement our own proactive throttling on a less granular basis (i.e. an entire channel, as the patch proposes), we'd end up with the same or less, not more. I don't mean less in overall volume, but also in terms of distinct kinds of messages and coverage.

The one thing I see that it would do, is spread out the allowance for that one noisy message. So let's say we have a very noisy message, it already doesn't affect other messages in the same or other channels, but it does (naturally) affect itself. With proactive sampling, we'd trade todays burst of 5K messages in 1 min follwoed by a 4min blindspot, for a somewhat continous <100/second trickle if we stay under the limit.

Is that the main motivation? Or is there something else?

Can you say more explicitly what the intended gain is?

  1. more confidence that we are looking at a random sample in the statistical sense, rather than the one-minute-on one-minute-off pattern of the throttling masking something
  2. WikimediaDebug log=1 logs do not get sampled, but they do get throttled

Also, with sampling you can see changes in total volume. Without it, it just maxes out at 100/sec and to see if a code change affected the log volume you need filter on pseudo-random subsets (like an arbitrary wiki or user agent) which is a hassle and somewhat unreliable.

Change #1153363 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Allow sampling of Logstash logs

https://gerrit.wikimedia.org/r/1153363

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:06:59Z] <taavi@deploy1003> Started scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:21:13Z] <taavi@deploy1003> taavi, tgr: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:31:29Z] <taavi@deploy1003> Finished scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] (duration: 24m 30s)

matmarex subscribed.

Works as expected. This is the effect on authevents logs:

image.png (1,535×184 px, 18 KB)

The session writes dashboard is now sampled 1:1000 and seems to behave reasonably.