Page MenuHomePhabricator

Some subset of MediaWiki Logstash events are capped to 100/s
Closed, ResolvedPublic

Description

Per T393038#10810826, some subset of MediaWiki Logstash logs seem to be capped at 100 / sec, with further log events discarded. (Not whole channels, and certainly not the total log volume which is around 400 / sec; seems to be based on the normalized_message field?)

The dropped events are not in the dead letter queue, which is much lower volume. I think they make it through udp2log, so this is specific to the Logstash-only part of the logging stack. The cap seems to be applied with a few-minute window, e.g. here is a high-volume log event (normalized_message:"Central autologin attempt"):

Screenshot Capture - 2025-06-03 - 14-52-33.png (627×2 px, 110 KB)

exactly 90K events in 15 minutes (ie. 100/sec) but distributed in an uneven, sawtooth pattern of exactly 15K events in a roughly 1-minute window and then zero events for a slightly less than 1-minute or 2-minute interval (those two lengths seem to be alternating). So maybe a combination of a 15K/60s cap and a 30K/300s cap?

While this is probably a good thing (the log stream where I noticed this ended up much higher volume than I thought, and without such filtering would have tripled the total MediaWiki log volume), it doesn't seem to be documented anywhere and it can cause a lot of confusion. (Docs I checked: wikitech:Logstash, especially the Configuration, Gotchas and Troubleshooting sections; wikitech:Logs#mwlog1002:/srv/mw-log/; wikitech:Rsyslog; wikitech:Kafka#logging (eqiad and codfw)). Nor did I find any guidance on what would be an acceptable level of log volume for a new event or MediaWiki logs in total.

Event Timeline

I skimmed codesearch results for 100/6000/30000 but did not see anything relevant. The rsyslog config has RateLimit.Burst=100 for MediaWiki but that doesn't really match the behavior.

A check for normalized_message:"Central autologin attempt" in udp2log files:

tgr@mwlog1002:/srv/mw-log$ ack '^2025-06-03 00:20:' authevents.log > ~/authevents-2025-06-03-00-20.txt
tgr@mwlog1002:/srv/mw-log$ ack 'Central autologin attempt' ~/authevents-2025-06-03-00-20.txt | wc -l
44659

so that's about 800/sec. (And a massive waste of disk space - authevents really shouldn't be recorded at all. True for some other log channels as well, like poolcounter or xff.)

A negative side effect of this is that the mwdebug Logstash dashboard becomes unreliable. It would be nice to exempt debug hosts from the cap, they don't have much log volume.

@Tgr wrote in the task description:

[…] exactly 90K events in 15 minutes (ie. 100/sec) but distributed in an uneven, sawtooth pattern of exactly 15K events in a roughly 1-minute window and then zero events for a slightly less than 1-minute or 2-minute interval (those two lengths seem to be alternating).
So maybe a combination of a 15K/60s cap and a 30K/300s cap? […]

$ ssh mwdebug1001.eqiad.wmnet 

krinkle@mwdebug1001:~$ mwscript eval.php testwiki
> $logger = MediaWiki\Logger\LoggerFactory::getInstance('AdHocDebug');
> $logger->info('Hello from Krinkle on mwdebug eval');
> exit;

krinkle@mwdebug1001:~$ mwscript eval.php testwiki
> $logger = MediaWiki\Logger\LoggerFactory::getInstance('AdHocDebug');
> for ($i = 0; $i < 200; $i++) { if ($i % 42 === 0) { echo "$i\n"; } }
0
42
84
126
168
> for ($i = 0; $i < 2000; $i++) { $logger->info('Hello from noisy Krinkle'); if ($i % 42 === 0) { $logger->info('Hello from not-noisy Krinkle every 42nd iteration'); } }
> exit;

Screenshot 2025-06-03 at 13.31.38.png (1×2 px, 202 KB)

All 2000 noisy and 48 non-noisy messages all arrived within a 30s window, which confirms Gergo's theory that there is indeed not a strict 100/s limit.

krinkle@mwdebug1001:~$ mwscript eval.php testwiki
> $logger = MediaWiki\Logger\LoggerFactory::getInstance('AdHocDebug');
> for ($i = 0; $i < 20_000; $i++) { $logger->info('My noisy message'); if ($i % 420 === 0) { $logger->info('My not-noisy 1-in-420 message'); } }
> exit;

Screenshot 2025-06-03 at 13.31.43.png (1×1 px, 123 KB)

All 20,000 noisy and 48 non-noisy messages arrived within a 30s window. This suggests there is not a 15K burst limit.

krinkle@mwdebug1001:~$ mwscript eval.php testwiki
> $logger = MediaWiki\Logger\LoggerFactory::getInstance('AdHocDebug');
> for ($i = 0; $i < 80_000; $i++) { $logger->info('My other noisy message'); if ($i % 420 === 0) { $logger->info('Some other not-noisy 1-in-420 message'); } }

Screenshot 2025-06-03 at 13.31.49.png (972×1 px, 115 KB)

Exactly 30,000 of the 80,000 noisy messages made it. The other 50K were silenly dropped. Meanwhile the 191 non-noisy messages distributed throughout, all made it through.

This suggest there is indeed a 30K per-message burst cap. I don't know the interval, but this seems likely set to 5 minutes given that after a 5min gap, the CentralAuth messages re-surface in Logstash in the sawtooth patterh Gergo saw.

@Tgr wrote in the **task description*:

seems to be based on the normalized_message field?

A Codesearch in Puppet for normalized_message brings me to:

https://gerrit.wikimedia.org/g/operations/puppet/+/9478f801d8850d23ff056ffc02128061ee4d52d2/modules/profile/files/logstash/filters/75-filter_throttle.conf

# add "throttled" tag at > 5000 dupes within 5 min per-instance (~30/sec cluster-wide, see note below)
# note: this permits between 16-50 sustained errors/sec cluster-wide (log events are not perfectly balanced between all 3 logstash collector instances)
throttle {
  after_count  => 5000
  period       => 300
  max_age      => 
  key          => "%{type}%{channel}%{level}%{normalized_message}"
  add_tag      => "throttled"
  id           => 
}
if "throttled" in [tags] {
  drop 
}

I just found this in search, so this file might be unused, outdated, or not applicable to MediaWiki messages.

It looks promising. It mentions a 5-min window (300s) as we Gergo saw, but it also says it drops messages after receiving 5K messages in that window. This isn't exactly what we observe, but it could be it. It says the throttling is local to each Logstash processor, and it says there are 3 Logstash instances in production.

That would amount to ~15K messages per 5min, if the messages are exactly evenly distributed.

For our observation to be caused by this config file, there'd have to be 6 Logstash instances (30K/5K = 6).

Are there? The code comment says we have 3, but that could be outdated.

https://grafana.wikimedia.org/d/000000561/logstash

This shows internal Logstash metrics. I don't know what most of these mean, but the first panel is an easy one. It reports how many messages it is processing, and it reports them by instance.

Screenshot 2025-06-03 at 13.54.44.png (1×2 px, 784 KB)

And I count... 6 distinct instances.

  1. logstash1023
  2. logstash1024
  3. logstash1025
  4. logstash1030
  5. logstash1031
  6. logstash1032
Tgr claimed this task.

@Krinkle found the config file where this is set up. I added some docs and filed T395967: Allow sampling of Logstash events about handling high-volume event streams in a better way; I think this can be closed.