Page MenuHomePhabricator

Global filters cache key not being invalidated
Open, Needs TriagePublic

Description

AbuseFilter uses the global:abusefilter:rules:CENTRALWIKINAME:GROUPNAME cache key to store global filters. The key has a TTL of one week (see getWithSetCallback call), but it's touched every time a global filter is changed on the central wiki (see code).

While testing T253181 on the beta cluster, I added a new global rule on deploymentwiki, but it didn't work on enwiki, despite deploymentwiki being the central DB. So I used shell.php on deployment-deploy01.eqiad.wmflabs, and got the following:

>>> $gbr = AbuseFilter::getGlobalRulesKey('default');
=> "global:abusefilter:rules:deploymentwiki:default"
>>> $cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
=> WANObjectCache {#391}
>>> $ttl = null;
=> null
>>> $cache->get( $gbr, $ttl )
[ Long list of filters, NOT including the filter I had just modified ]
>>> $ttl
=> 459577.06165004

I then used

>>> $cache->touchCheckKey( $gbr )
=> true
>>> $cache->get( $gbr, $ttl )
 [ Same list as above ]

and then, (unsure when exactly) the filter got added to the cached list.

Is the delay expected, or is something not working (either in AF or WanObjectCache)?

Event Timeline

Also, I just realized that if I change the "consequences" section, the new actions take effect immediately, but with the old rules! This is probably because the data is stored in two different tables, and also, the consequences aren't cached. Long story short, this edit caused several users to be blocked. Perhaps this would deserve its own task (?)

Probably me misunderstanding, but as enwiki is not a global abuse filter wiki, why would you expect a change at enwiki?

Probably me misunderstanding, but as enwiki is not a global abuse filter wiki, why would you expect a change at enwiki?

Ah, I'm sorry, I haven't been clear in the task description. I meant beta-enwiki (i.e. on the Beta-Cluster-Infrastructure), where global filters are enabled for all wikis, and the central wiki is deploymentwiki.

I am not sure but I believe this uses touchCheckKey incorrectly.

The use of check keys is mainly for being able to purge a large group of keys at once, by "touching" a common "check key". For example, if you have lots of keys in the style foo:bar:<something> then you could give getWithSet or get also key foo:bar through to the checkKeys option. And if that key was recently touched, it will act as if the real key given to getWithSet had a cache miss and act accordingly.

To trigger that, you only have to call touchCheckKey with foo:bar. It seems the checkKeys option is not being set in the get call. As such, the touch has no effect, I think?

Alternatively, if no key sharing or touch propagation is needed, then you may want to use purge() instead to simply purge the relevant key directly.

Thanks for the explanation. Frankly, I don't have enough knowledge of WANCache to tell exactly what needs to be done here. Using purge() seems fine, but I can't tell for sure. I'm CC'ing @aaron who introduced the checkKeys logic in AF with rEABF60d1eebb6b8a91b9f471b5de2097b7e248c3a7b9 (a few years ago...)

@Daimona I've confirmed with Aaron that the checkKeys option needs to be set correctly. Looking at the code though, this appears to be the case already in latest master:

	protected function getGlobalFilters() : array {
		$globalRulesKey = AbuseFilter::getGlobalRulesKey( $this->group );
		$fname = __METHOD__;
		return MediaWikiServices::getInstance()->getMainWANObjectCache()->getWithSetCallback(
			$globalRulesKey,
			WANObjectCache::TTL_WEEK,
			function () use ( $fname ) {
				
			},
			[
				'checkKeys' => [ $globalRulesKey ], # <!-- this line
				'lockTSE' => 300,
				'version' => 1
			]
		);
	}

When reducing the call to just $cache->get(key) you are asking for the internal value stored in Memcached, with the "inherited" checkKey which lets you purge it.

> $gbr = AbuseFilter::getGlobalRulesKey('default');
>  $cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
> $cache->touchCheckKey( $gbr );
> $ttl = null; return is_array($cache->get( $gbr, $ttl, [$gbr] )); # <! $checkKeys param
bool(true)
> return $ttl;
-2.6161530017853

This returns a negative TTL as expected, which getWithSetCallback would have perceived as cache miss with regeneration callback being run to produce a new value.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

@Krinkle: So the AF code is correct? If so, my next question is: is it expected to observe a delay between the touchCheckKey() call and the values being updated? Because in my test on the beta cluster, I changed a global rule, made a few edits, but the rules weren't being updated. In fact, it kept using the old rules for roughly 10 minutes.

Yeah, so long as the actual getWithSet call uses [ 'checkKeys' => … ] then it should be fine.

Can you reproduce this issue in Beta Cluster from the command-line, like in the task description, but with proper checkKeys calls?

I was just able to reproduce the issue again. I used https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:AbuseFilter/28 and https://en.wikipedia.beta.wmflabs.org/wiki/User:Daimona_Eaytoy/sandbox

  • I disabled the filter at 17:24 UTC (this should have called touchCheckKeys)
  • It kept catching edits on my sandbox until 17:40 (and then I stopped trying). Note that every edit calls getGlobalFilters(), which includes getWithSetCallback() (the snippet at T253213#6204057)

Then with shell.php for beta-enwiki:

>>> $gbr = AbuseFilter::getGlobalRulesKey('default');
=> "global:abusefilter:rules:deploymentwiki:default"
>>> $cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
=> WANObjectCache {#429}
>>> $ttl = null;
=> null
>>> $cache->get( $gbr, $ttl )
[ List of filters INCLUDING filter 28 ]
>>> $ttl
=> 579859.85870409
>>> $cache->touchCheckKey( $gbr )
=> true

And the next edit (at 17:42) was not caught! Retrying $cache->get() returned the updated list, NOT including filter 28.

Tomorrow I can try adding a debug print on the touchCheckKey call in the code to see if it's actually triggered. FWIW, I confirm that the line is executed locally, but I didn't replicate the beta cluster situation exactly.

@Daimona That last snippet is still calling $cache->get() incorrectly, which is not what the real code does. It needs to use get( …, [ 'checkKeys' => … ]);.

Given you are able to observe it on the site itself, it suggests this might not make a difference, but it will at least help confirm whether it's reproducible on CLI or not.

@Daimona That last snippet is still calling $cache->get() incorrectly, which is not what the real code does. It needs to use get( …, [ 'checkKeys' => … ]);.

Ah, I missed that part.

Given you are able to observe it on the site itself, it suggests this might not make a difference, but it will at least help confirm whether it's reproducible on CLI or not.

I agree. I don't know how to add a debug call in the code on the beta cluster (vim scared me with a readonly alert), so I haven't tried that part. However:

>>> $gbr = AbuseFilter::getGlobalRulesKey('default');
=> "global:abusefilter:rules:deploymentwiki:default"
>>> $cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
=> WANObjectCache {#292}
>>> $ttl = null;
=> null
>>> foreach ($cache->get( $gbr, $ttl ) as $e) {var_dump($e->af_id);}
string(2) "11"
string(2) "13"
string(2) "17"
string(2) "20"
string(2) "24"
string(2) "25"
string(2) "26"
string(2) "33"
string(2) "34"
string(2) "35"
string(2) "36"
string(2) "37"
string(2) "38"
string(2) "39"
>>> $ttl
=> 596848.44011998
>>> foreach ($cache->get( $gbr, $ttl, [$gbr] ) as $e) {var_dump($e->af_id);}
string(2) "11"
string(2) "13"
string(2) "17"
string(2) "20"
string(2) "24"
string(2) "25"
string(2) "26"
string(2) "33"
string(2) "34"
string(2) "35"
string(2) "36"
string(2) "37"
string(2) "38"
string(2) "39"
>>> $ttl
=> 596739.14860201

So no filter 28 in the list. I then kept trying edits on enwiki, but the filter still didn't catch those until 10:32. I then touched the key manually:

>>> $cache->touchCheckKey($gbr);
=> true
>>> foreach ($cache->get( $gbr, $ttl, [$gbr] ) as $e) {var_dump($e->af_id);} #### Immediately after the call above
string(2) "11"
string(2) "13"
string(2) "17"
string(2) "20"
string(2) "24"
string(2) "25"
string(2) "26"
string(2) "33"
string(2) "34"
string(2) "35"
string(2) "36"
string(2) "37"
string(2) "38"
string(2) "39"
>>> foreach ($cache->get( $gbr, $ttl, [$gbr] ) as $e) {var_dump($e->af_id);} #### 30-60 seconds after the command above
string(2) "11"
string(2) "13"
string(2) "17"
string(2) "20"
string(2) "24"
string(2) "25"
string(2) "26"
string(2) "28"
string(2) "33"
string(2) "34"
string(2) "35"
string(2) "36"
string(2) "37"
string(2) "38"
string(2) "39"

There've been a tiny lag between the touch and the value being regenerated, but that's fine. It caught an edit between the two $cache->get calls. So it seems to be consistently reproducible from CLI, too (unless I did something wrong).

[…] I don't know how to add a debug call in the code on the beta cluster (vim scared me with a readonly alert), so I haven't tried that part.

The /srv/mediawiki in Beta (both deploy01 and mediawiki07) are owned by mwdeploy. Use sudo -iu mwdeploy or some such to act as that user. The sudo for this is granted to any bet admin. If you're not a beta admin currently, ask someone (done!).

Mentioned in SAL (#wikimedia-releng) [2020-06-26T12:20:31Z] <Daimona> Adding debug logging on beta-deploymentwiki for T253213

So I tried sudoing, and was able to edit the file and add debug logging. However, it didn't show up on logstash, because channel:AbuseFilter is missing in the config. I also tried SSHing to fluorine02, but the logs aren't there either. So I went back to just revert my changes to the AF files, and found out that the changes were already gone. I'm unsure what happened :-/

[…] I went back to just revert my changes to the AF files, and found out that the changes were already gone. I'm unsure what happened :-/

There is a Jenkins job for Beta Cluster that, every 15 min, deploys the latest changes from the main branch on Gerrit.