Page MenuHomePhabricator

Throttling isn't counted at all for filters using it
Closed, ResolvedPublicPRODUCTION ERROR

Description

Today on itwiki I noticed a serious problem: as any other wikis, we have several filters set to throttle and then disallow the edit (or block the user) after a fixed amount of edits. By chance, I noticed that one of these filters isn't working at all: see filter settings and its AbuseLog, compared to the amount of edits of the last user hitting it (requires privileges). The rate limiter isn't taken into account at all, thus always disallowing the first edit. The last change to that filter is from May, and the new version of MediaWiki (1.33.0-wmf.6), which solves throttle-related problems, still hasn't got on itwiki. Since that filter has nothing special and the problem doesn't directly depends on a MediaWiki update, this means that potentially every filter using throttle in WMF prod is misbehaving and ignoring throttle.
I can surely say that the problem started yesterday between 16:43-16:45 UTC (basing on user contributions and filter hits); so, in order to move on, we basically need two things:

  • Understand what happened yesterday at that time
  • A logstash grep for basically every message passed to logger->debug() inside AbuseFilter::isThrottled(), for itwiki and for the time period 16:30 UTC - 9:30 UTC (if there are too many entries we'll see how to refine the search)

Needless to say, I already disabled the offending filter on itwiki, but that wouldn't help much in the case of a "global" bug.

Event Timeline

Daimona triaged this task as Unbreak Now! priority.Nov 29 2018, 9:40 AM

Just making it clear, this is not train blocker since on itwiki we're still at wmf.4; until proven otherwise, of course.

Totally random idea: are cache entries deleted after they expire?

I feel this is a cache problem, but I really cannot explain it (especially because it appeared suddenly, and not during train rollout).

Little update: this still isn't working. I tried to re-enable the filter, and the first edit was caught. I also couldn't find anything happened yesterday which could have caused this.

Another update: this still isn't working. However, I created a simple throttling filter on testwiki (https://test.wikipedia.org/wiki/Special:AbuseFilter/187) and it worked as expected. Sounds like the throttle cache key isn't getting deleted, or something like that. Given the lack of anything relevant in the SAL, may I ask if there has been some secret deployment or some cache-related maintenance operation on November 28th? Without other infos I have to assume this is related to the train, even though it happened before the train reached itwiki.

EDIT: Tested directly on itwiki with my own account, the result is the same from testwiki: everything works as expected. Now it definitely sounds like cache keys haven't been deleted. Is there a way to check for it?

I don't see anything in the SAL around that time period either.

Your suspicion that it had to do with the throttle cache key not expiring seems reasonable. The logs for that channel don't go back to the 28th, so I can't fulfill that request. I see the throttle keys for the itwiki filter you linked will begin with "itwiki:abusefilter:throttle:320:", for which I see the following log messages:

2018-11-30 13:29:33 [XAE7PQpAADsAAJH21zgAAABC] mw1264 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Got value 18 for throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-11-30 13:29:33 [XAE7PQpAADsAAJH21zgAAABC] mw1264 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Incremented throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-11-30 13:29:33 [XAE7PQpAADsAAJH21zgAAABC] mw1264 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Throttle itwiki:abusefilter:throttle:320:user:REDACTED hit value 19 -- maximum is 5.  
2018-11-30 13:29:43 [XAE7RwpAICAAAAoduiEAAAAQ] mw1330 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Got value 19 for throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-11-30 13:29:43 [XAE7RwpAICAAAAoduiEAAAAQ] mw1330 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Incremented throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-11-30 13:29:43 [XAE7RwpAICAAAAoduiEAAAAQ] mw1330 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Throttle itwiki:abusefilter:throttle:320:user:REDACTED hit value 20 -- maximum is 5.  
2018-12-02 09:28:20 [XAOltApAMFQAALhuVUkAAAAJ] mw1249 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Got value 20 for throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:20 [XAOltApAMFQAALhuVUkAAAAJ] mw1249 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Incremented throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:20 [XAOltApAMFQAALhuVUkAAAAJ] mw1249 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Throttle itwiki:abusefilter:throttle:320:user:REDACTED hit value 21 -- maximum is 5.  
2018-12-02 09:28:28 [XAOlvApAMEwAAAYdMB4AAACG] mw1241 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Got value 21 for throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:28 [XAOlvApAMEwAAAYdMB4AAACG] mw1241 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Incremented throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:28 [XAOlvApAMEwAAAYdMB4AAACG] mw1241 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Throttle itwiki:abusefilter:throttle:320:user:REDACTED hit value 22 -- maximum is 5.  
2018-12-02 09:28:48 [XAOl0ApAICEAAH6UqzcAAACA] mw1331 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Got value 22 for throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:48 [XAOl0ApAICEAAH6UqzcAAACA] mw1331 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Incremented throttle key itwiki:abusefilter:throttle:320:user:REDACTED  
2018-12-02 09:28:48 [XAOl0ApAICEAAH6UqzcAAACA] mw1331 itwiki 1.33.0-wmf.6 AbuseFilter DEBUG: Throttle itwiki:abusefilter:throttle:320:user:REDACTED hit value 23 -- maximum is 5.

When I poke at that key with mwrepl, it claims to have no expiry set:

hphpd> $key = 'itwiki:abusefilter:throttle:320:user:REDACTED';
hphpd> $w = Wikimedia\TestingAccessWrapper::newFromObject( ObjectCache::getMainStashInstance() );
hphpd> list( $server, $conn ) = $w->getConnection( $key );
hphpd> =$conn->get( $key )
"23"
hphpd> =$conn->ttl( $key )
-1

Since the count is currently at 23 (and it's a user trigger), I'd guess we should look at the 23rd hit back for that filter+user in abuse_filter_log to guess that the bad cache key was created on 2017-01-10. Were there any bugs active back then that would have caused the throttle key to have been created with no expiry timestamp?

If nothing else, someone could go in with mwrepl and execute ObjectCache::getMainStashInstance()->delete( $key ) for any affected keys.

Daimona lowered the priority of this task from Unbreak Now! to High.Dec 10 2018, 8:53 PM

@Anomie Many thanks for the analysis! This explains many things, for instance why the bug cannot be triggered for another filter/user; this also allows us to lower priority (although it remains important to fix it). Looking at the logs, it seems that in fact the key was created on January 10th 2017, which is a lot of time ago! On that day (and the one before) the user triggered the filter hundreds of time, but then the filter stopped disallowing the edit, which is weird. Tomorrow I'll go ahead and check what happened those days. The problem could actually come either from AbuseFilter, from core, or even from something else, so it won't be easy to find the root case.

I took a look, but nothing came out. The MW version at the time was 1.29.0-wmf.7 (since Jan 5th), and remained in place until Jan 19th (so no train that week). I also checked backports to wmf.7, but couldn't find anything interesting there (either in core or AbuseFilter). However, if the key was created (without expiry) on 2017-01-10, it would mean that the filter hasn't worked for almost two years, albeit having a key already set to the max value (and not being modified!); then, it resumed working with apparently no reason. This doesn't make sense! I must also note that the user in question has high edit rates, and even higher filter-hitting rates (not "hundreds" like I said yesterday, but still they hitted the filter roughly 150 times within two days), which makes scanning edits and abuselog pretty difficult.
I'm at my wits' end, and it's such a shame not to have the debug log for Nov 28th, which could have helped to understand what happened. My last raw guess is that it could partly have to do with T211101, although I don't see what may have caused ambiguity here, and it wouldn't completely explain the problem neither (not the infinite TTL at least).
My final thought is that lately I've been fixing some logic related to throttling, although I cannot see how it could have to do with this bug; maybe we could've had a situation where the throttle period wasn't correctly specified when editing the filter, and due to the lack of validation we ended up with a 0, which in turn was taken as "no expiry" (is that even possible?), but again this shouldn't be the case because throttle parameters have always been correct for that filter.
If no other information can be retrieved, then we could just go on and delete the key to get the filter back working, although I'm still a bit worried. BTW, I guess the debug log only has a lifetime of 10 days, but maybe other channels (like PHP warnings/notices/errors) have a longer one; if so, could you please check if there have been any warning due to AbuseFilter on Nov 28th around 16.30-17.00 UTC?

and due to the lack of validation we ended up with a 0, which in turn was taken as "no expiry" (is that even possible?),

0, empty string, null, and false all seem to be interpreted as "no expiry". So pretty much anything that could have resulted in $ratePeriod not being correctly set would have that effect.

If no other information can be retrieved, then we could just go on and delete the key to get the filter back working, although I'm still a bit worried. BTW, I guess the debug log only has a lifetime of 10 days, but maybe other channels (like PHP warnings/notices/errors) have a longer one; if so, could you please check if there have been any warning due to AbuseFilter on Nov 28th around 16.30-17.00 UTC?

Searching log entries tagged with wiki:itwiki and containing "AbuseFilter", I see only the "cache hit" and "cache miss" messages from AbuseFilter::filterAction().

Expanding it to all wikis, there are also

  • A bunch of DBPerformance warnings whining about "Expectation (masterConns <= 0)".
  • A login on enwiki for someone in the "abusefilter" group.
  • Two "PHP Notice: Undefined index: 1" from /srv/mediawiki/php-1.33.0-wmf.6/extensions/AbuseFilter/includes/AbuseFilter.php:932, one on dewiktionary at 16:56:04 and one on ruwiktionary at 16:56:01.

Those "Undefined index: 1" warnings on that line would have caused $ratePeriod to be null, but obviously they're on the wrong wiki to have affected your itwiki filter.

Daimona lowered the priority of this task from High to Medium.Dec 11 2018, 4:51 PM

0, empty string, null, and false all seem to be interpreted as "no expiry". So pretty much anything that could have resulted in $ratePeriod not being correctly set would have that effect.

That makes sense. However, as I was saying above, the filter seems to have valid parameters, and they also seem to be valid in every history entry for that filter. Apparently, nothing else in the logs could explain the problem here.
As for the undefined index, that indeed would explain the problem! That problem is T203535, which will be resolved after T209565.
Anyway, I think we are lacking info to understand this problem. Mainly, knowing when the key was created could help a lot, and makes a big difference:

  1. If it was created on Jan 10th 2017 (i.e. 18 hits before the first abuselog entry logged in debug), then we would have to understand why the filter didn't work for almost 2 years, albeit with a throttle value already to the maximum (and no changes in the meanwhile)
  2. If it was created on Nov 28th (i.e. 18 edits - or so - before that last entry), then we would have to understand why it got no expiry set, without the codebase being changed at all, and in a non-reproducible fashion

(1.) makes more sense, but is harder to understand, especially because it happened a long ago. I'm leaving the task open in case someone comes out with an idea, or with a new fresh example, or something else, but I don't really hope we'll get an answer. In the meanwhile, I think you can safely delete the key (if you haven't already).

In the meanwhile, I think you can safely delete the key (if you haven't already).

I haven't.

@Anomie have you perchance deleted the key? I reenabled the filter and it still doesn't work. If you did delete it, this would now be a fresh example!

Oh, false hope then :/ Thanks anyway.

I deleted the key.

Or at least 1 key. No guarantee its the only key with this bug.

@Bawolff Thanks! True, although I'm unsure how they could be identified. If there's a way to find them, any key matching the pattern [a-z]+:abusefilter:throttle:[0-9]*:.* without expiry suffers from this bug. If such search isn't feasible, then another idea could be to find keys without expiry created around 10 Jan 2017.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM
CCicalese_WMF claimed this task.
CCicalese_WMF subscribed.

Reopen if this is still happening.