Page MenuHomePhabricator

Reduce noisy auth logs
Closed, ResolvedPublic

Description

As an example: For the last 15 minutes, there were 705,279 mediawiki logs in total. Out of which, were 264,194 session channel, 103,375 in authentication channel and 98,460 were authevents logs. All three being the top channels. Of course we still should keep many of them, e.g. successful or unsuccessful logins, but 66% of all mediawiki logs being related to this seems a bit excessive. I wonder if there are a couple of low hanging fruits we can cut and call it a day?

Here is the top messages:

grafik.png (384×683 px, 43 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

authevents is a dedicated log channel for keeping track of authentication volume. It's primarily for Prometheus, so we can turn off the Logstash backend if someone feels strongly about it, but it's nice to have IMO. The Persisting... stuff was added for {T309943} & co, I don't think we need it anymore. The Session store:... stuff is for {T390514}, we are still actively investigating that. Might want to keep in the longer term too but we'll see. The SUL3 logs aren't needed (the idea was to track bounce rates but due to the massive amount of scraping on the login page, that ended up being completely useless).

Change #1146716 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@master] Reduce authentication related logs after SUL3 rollout

https://gerrit.wikimedia.org/r/1146716

The Persisting... stuff was added for {T309943} & co, I don't think we need it anymore

I think this will be a partial revert of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1143959 and I imagine that removing the else block will be enough here?

else {
	$message = "$verb session due to {$this->persistenceChangeReason}() call";
}

Are you done with your investigation Tgr or can this just go away now?

Change #1146719 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] session: Reduce session/auth related logs post investigation

https://gerrit.wikimedia.org/r/1146719

Thank you! From my POV, as long as it's somewhat reasonable amount of logs, I have no preference what should be cut (as I don't use those logs, except goodpass/badpass) I leave the decision to you!

Change #1146716 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Reduce authentication related logs after SUL3 rollout

https://gerrit.wikimedia.org/r/1146716

The Persisting... stuff was added for {T309943} & co, I don't think we need it anymore

I think this will be a partial revert of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1143959 and I imagine that removing the else block will be enough here?

No, that's the logging happenin in the logPersistenceChange() method.

In terms of total log volume, these are insignificant. We should fix them (especially the $_SESSION one which goes to the error log which is normally very low-volume) but it's a different kind of problem.

Change #1147005 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@master] Drop prometheus metrics in CentralAuthRedirectingPrimaryAuth provider

https://gerrit.wikimedia.org/r/1147005

No, that's the logging happenin in the logPersistenceChange() method.

I was looking for a way to remove just a subset of the logging - at least the one that is being logged the most (not everything). But as discussed on the patch, I've just made this debug logs instead. That should suffice for now until we can finally remove the entire method soon.

Change #1147005 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Drop prometheus metrics in CentralAuthRedirectingPrimaryAuth provider

https://gerrit.wikimedia.org/r/1147005

Change #1146719 merged by jenkins-bot:

[mediawiki/core@master] session: Reduce session/auth related logs post investigation

https://gerrit.wikimedia.org/r/1146719

logPersistenceChange() logs cookie writes. Those are not particularly relevant to the current investigation which is about session store writes. They were added for issues where the cookies got cached and sent to the wrong user, but we didn't have such issues for years so probably fine to drop that logging.

Change #1147827 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] session: Drop the `SessionBackend::logPersistenceChange()` method

https://gerrit.wikimedia.org/r/1147827

Change #1147877 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Revert "CentralAuthTokenManager: Log failures for write operations"

https://gerrit.wikimedia.org/r/1147877

Change #1147827 merged by jenkins-bot:

[mediawiki/core@master] session: Drop the `SessionBackend::logPersistenceChange()` method

https://gerrit.wikimedia.org/r/1147827

Change #1147877 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Revert "CentralAuthTokenManager: Log failures for write operations"

https://gerrit.wikimedia.org/r/1147877

Logstash query to check our progress once wmf.3 deployments are done: https://logstash.wikimedia.org/goto/abd93ced8733fa9289b42f12a834a6be

Looks good to me. @Ladsgroup, are you fine with how https://logstash.wikimedia.org/goto/c10db2bb43a11a73ece54e44918e895d looks now? If so then I think we can call this done?

The three main sources of log volume were:

  • logPersitenceChange() (removed)
  • logSessionWrite() (still needed for a while)
  • authevents (not primarily a Logstash thing but we should review how much other log channels cover the information need)

(and the SUL3 login attempt start/end thing, also removed)

Logstash query to check our progress once wmf.3 deployments are done: https://logstash.wikimedia.org/goto/abd93ced8733fa9289b42f12a834a6be

Looks good to me. @Ladsgroup, are you fine with how https://logstash.wikimedia.org/goto/c10db2bb43a11a73ece54e44918e895d looks now? If so then I think we can call this done?

Thanks! wmf.3 is not deployed yet (from what I'm seeing in Special:Version) so once that's out there the impact would be even more visible. I'm grateful. Whatever you decide is fine with me.

Per T395899: Some subset of MediaWiki Logstash events are capped to 100/s it seems that the authevents and session channels together produce something like 2000 events per second. Ouch. That's five times the total Logstash volume. They don't all make it into Logstash, fortunately, but they do make it to mwlog and eat up disk space. We should sample them for Logstash and disable udp2log altogether.

Change #1153364 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/mediawiki-config@master] logging: Sample some high-volume log streams

https://gerrit.wikimedia.org/r/1153364

Change #1153721 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] session: Send high-volume session write logs to separate channel

https://gerrit.wikimedia.org/r/1153721

Change #1153364 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Sample some high-volume log streams

https://gerrit.wikimedia.org/r/1153364

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:06:59Z] <taavi@deploy1003> Started scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:21:13Z] <taavi@deploy1003> taavi, tgr: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-09T13:31:29Z] <taavi@deploy1003> Finished scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] (duration: 24m 30s)

Change #1153721 merged by jenkins-bot:

[mediawiki/core@master] session: Send high-volume session write logs to separate channel

https://gerrit.wikimedia.org/r/1153721

Last 15 minutes now: about 300K, none of it authentication-related.

Screenshot Capture - 2025-06-16 - 12-40-29.png (607×732 px, 77 KB)
Screenshot Capture - 2025-06-16 - 12-40-05.png (875×1 px, 272 KB)

(Also we have T395967: Allow sampling of Logstash events now, that could be applied to other logs which are high-volume and not that important.)