$wmgMonologChannels has a sample property but using it automatically disables sending the events to Logstash. Sampling would be a step up from the much less controllable throttling mechanism of T395899: Some subset of MediaWiki Logstash events are capped to 100/s, so we should fix that.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| logging: Allow sampling of Logstash logs | operations/mediawiki-config | master | +4 -9 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • DAlangi_WMF | T394402 Reduce noisy auth logs | |||
| Resolved | Tgr | T395967 Allow sampling of Logstash events |
Event Timeline
Change #1153363 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[operations/mediawiki-config@master] logging: Allow sampling of Logstash logs
Sampling would be a step up from the much less controllable throttling mechanism of T395899, so we should fix that.
Can you say more explicitly what the intended gain is?
I think there's a hypothetical scenario, different from ours, where application-side sampling could improve visibility. For example, let's imagine that the Logstash-side sampling were indiscriminate at the level of an entire channel, or server, or wiki, or something else crude. That would mean when we hit that limit due to one very noisy message, we lose a bunch of valuable insight in other channels for a few minutes until the throttle rolls over. In that scenario, identifying the noisy channel and throttling it ourselves, gives us back some budget to spend on (or, not lose) other messages.
The scenario we have, as described at T395899, seems to be that the Logstash-side throttling is already at the most granular or "fair" level: per unique message. So it would seem, if we implement our own proactive throttling on a less granular basis (i.e. an entire channel, as the patch proposes), we'd end up with the same or less, not more. I don't mean less in overall volume, but also in terms of distinct kinds of messages and coverage.
The one thing I see that it would do, is spread out the allowance for that one noisy message. So let's say we have a very noisy message, it already doesn't affect other messages in the same or other channels, but it does (naturally) affect itself. With proactive sampling, we'd trade todays burst of 5K messages in 1 min follwoed by a 4min blindspot, for a somewhat continous <100/second trickle if we stay under the limit.
Is that the main motivation? Or is there something else?
- more confidence that we are looking at a random sample in the statistical sense, rather than the one-minute-on one-minute-off pattern of the throttling masking something
- WikimediaDebug log=1 logs do not get sampled, but they do get throttled
Also, with sampling you can see changes in total volume. Without it, it just maxes out at 100/sec and to see if a code change affected the log volume you need filter on pseudo-random subsets (like an arbitrary wiki or user agent) which is a hassle and somewhat unreliable.
Change #1153363 merged by jenkins-bot:
[operations/mediawiki-config@master] logging: Allow sampling of Logstash logs
Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:06:59Z] <taavi@deploy1003> Started scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]]
Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:21:13Z] <taavi@deploy1003> taavi, tgr: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:31:29Z] <taavi@deploy1003> Finished scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] (duration: 24m 30s)
