Page MenuHomePhabricator

icinga notification if elevated writing to badpass.log
Closed, DeclinedPublic

Description

It'd be useful to know if there's more writing than normal to this log... Similar for general logs/errors

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/20161112-OurMine

Event Timeline

@bd808 pointed to the Kibana watcher plugin: https://github.com/elasticfence/kaae

If we wanted to try this plugin out, I think we would want to setup a new kibana instance somewhere. The current logstash.wikimedia.org kibana is actually 3 backend servers running behind an LB. This raises 2 problems: you don't know which of the 3 you are getting round-robbined to; if they share state via the elasticsearch cluster (which looks like how things are stored) then you would potentially get 3 alerts for each watch that fired.

We could (also) export the number of lines written to badpass to graphite and setup an icinga alert. The metric would be public though and so will the alert, I don't think it would be particularly troubling.

We could (also) export the number of lines written to badpass to graphite and setup an icinga alert. The metric would be public though and so will the alert, I don't think it would be particularly troubling.

render.png (250×400 px, 25 KB)

https://graphite.wikimedia.org/render?from=-2hours&until=now&width=400&height=250&target=logstash.rate.mediawiki.badpass.INFO.count&_uniq=0.9259289370548927&title=logstash.rate.mediawiki.badpass.INFO.count

@bd808 looks good! I guess the regular icinga/graphite check could be used in this case then

fgiunchedi triaged this task as Medium priority.Nov 30 2016, 2:10 AM

I had a few minutes so I looked at this because it would be super swell to have it rigged up. It's a bit complicated at the moment.

Note T123243: Ability to alert when we get a sudden increase in bad passwords for privileged accounts and T193769: Thousands of failed login attempts (wrong password) are closely related.

This is still present...

logstash.rate.mediawiki.badpass.INFO.count

https://graphite.wikimedia.org/render?from=-2hours&until=now&width=800&height=500&target=logstash.rate.mediawiki.badpass.INFO.count&_uniq=0.9259289370548927&title=logstash.rate.mediawiki.badpass.INFO.count&from=-72h

but in looking at badpass.log I noticed content that seems to be indicative of successful authentication rather than failed. So that set me wondering if that graph is a sane representation of what we would want to monitor. In talking with @Reedy a bit I tracked it back to {T150554} (which was declined eventually but changes were associated with it)

https://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php

// T150554 log successful attempts too
$wgHooks['AuthManagerLoginAuthenticateAudit'][] = function ( $response, $user, $username ) {
	if ( $response->status === \MediaWiki\Auth\AuthenticationResponse::PASS ) {
		global $wgRequest;
		$headers = function_exists( 'apache_request_headers' ) ? apache_request_headers() : [];

		$privGroups = wfGetPrivilegedGroups( $username, $user );
		$logger = LoggerFactory::getInstance( 'badpass' );
		$logger->info( 'Login succeeded for {priv} {name} from {ip} - {xff} - {ua} - {geocookie}', [
			'successful' => true,
			'groups' => implode( ', ', $privGroups ),
			'priv' => count( $privGroups ) ? 'elevated' : 'normal',
			'name' => $user->getName(),
			'ip' => $wgRequest->getIP(),
			'xff' => @$headers['X-Forwarded-For'],
			'ua' => @$headers['User-Agent'],
			'geocookie' => $wgRequest->getCookie( 'GeoIP', '' ),
		] );
	}
};

https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/CommonSettings.php

https://phabricator.wikimedia.org/rOMWCa3eb73714cd332ee2139dade020f3aa88bab8c76

Authored by Tgr on Nov 12 2016, 1:26 PM.
Log successful login attempts for a while
Includes https://gerrit.wikimedia.org/r/#/c/321114/ too

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/321114 abanoned in favor of
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/321926/

It seems like this was due to the need to quantify logins historically.

(another note as indicated on task is that centralauth does record failed logins but the logging is somewhat sparse https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@e0234f6&_a=h@fcff38f)

So we'll need to unwind this a bit to make sure badpass.log is only recording events we want to be associated with failure to authenticate, and potentially shift the successful logins to another file or discontinue. Since it's been happening for the last few years I suspect we should just keep it as it's a fairly useful thing that is added adhoc as needed anyway.

Change 464077 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Move auth logging to different channels for easier counting

https://gerrit.wikimedia.org/r/464077

Change 464077 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Move auth logging to different channels for easier counting

https://gerrit.wikimedia.org/r/464077

Thanks @Tgr!

Change 464077 merged by jenkins-bot:
[operations/mediawiki-config@master] Move auth logging to different channels for easier counting

https://gerrit.wikimedia.org/r/464077

Mentioned in SAL (#wikimedia-operations) [2018-11-01T00:05:47Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:464077|Move auth logging to different channels for easier counting (T150300, T123243)]] (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-11-01T00:07:13Z] <tgr@deploy1001> Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:464077|Move auth logging to different channels for easier counting (T150300, T123243)]] (duration: 00m 53s)

akosiaris subscribed.

No comments or updated in 4+ years. I am gonna suggest we resolve this, but SRE isn't the one driving this forward, so I 'll just remove SRE instead.

sbassett moved this task from Back Orders to Frozen on the Security-Team board.
sbassett subscribed.

The Security-Team are the ostensible drivers of this work, but we have no resources or plans to work on it, so I'll mark it declined for now.