Page MenuHomePhabricator

Investigate whether the AbuseFilter can specify which revisions to execute a rule on
Closed, ResolvedPublic5 Estimated Story Points

Description

Motivation
The AbuseFilter is an extension that checks if files should be uploaded to Commons or if they don't fulfil the project's criteria. One such rule is for example the "0-Byte-Rule", that disallows files where the file page is empty. While this makes perfect sense for the most recent revision, since the FileImporter moves many revisions it would be great if rules in the AbuseFilter could be written in such a way, that it is clear if this rule should apply to all revisions of a file or only the most recent one.

Acceptance Criteria

  • Investigate whether it will be possible to provide a way that rules of the AbuseFilter extension can include that they should only be carried out for the most recent revision
  • Discuss idea with AbuseFilter extension maintainers

Notes

  • It might be specifically from FileImporter for FileImporter, but even better (if same effort applies):
  • It might be totally independent from, but also applicable for the FileImporter, talking about “this should happen on old revisions, no matter where it was imported from”

Event Timeline

Feel free to ask anything about AbuseFilter, I'm glad to help. I have to say that I don't really know what FileImporter does (aside from what the name suggests), so I may be missing details. I think this goal could be achieved with AF hooks, however it strongly depends on how FileImporter performs the import. I think at the very least it should be possible to add a variable to tell whether the current upload is part of a FileImporter batch.

Lea_WMDE set the point value for this task to 5.

@Daimona Hi, thanks for the offer! The docs for FileImporter start here, but long story short we're importing File pages including their entire revision history, moving to commonswiki for the Wikimedia use case. We call AbuseFilter on each revision by manually invoking the EditFilterMergedContent hook.

The effect we're looking for is that a historicRevision variable is set to true and available to AF rules.

I'm imagining that we might have to cache specific the specific revision IDs in the importer, and our AF hook would check the revision ID (available from RC row?) against that cache.

Just to write down my first round of investigation and digging into the code.

If I got this right, there are several ways to add parameters and variables to the code being filtered by e.g. using one of the following hooks:

  • AbuseFilter-filterAction
  • AbuseFilter-generateTitleVars
  • AbuseFilter-generateUserVars

The hooks allow us to add custom parameters and get the content of the current parameters. Since we only want to add the oldrevision-import ( or however we want to call it ) variable only in certain cases and not to all filter runs, we must find a way how to manipulate the filters only when they were triggered by a FileImporter revision. And as I see it at the moment, that's where the problem is situated.

Our 'entry point' for the filters is EditFilterMergedContent[1] so only things we pass in there could later be used to decide if a filter was triggered by the FileImporter of from somewhere else. - But EditFilterMergeContent does not give us any parameter where we could pass that information, that could then end up in the AF variables.

[1] https://gerrit.wikimedia.org/g/mediawiki/extensions/FileImporter/+/d241fe3faa1616df8e857adf7c59939e1fa982a8/tests/phpunit/Services/FileTextRevisionValidatorTest.php#49

I added some debugging to dump the available vars being sent to AbuseFilter-filterAction, evaluating all the lazy vars using $vars->dumpAllVars(). The results look promising,
Text revision:


File revision:

Text revision added_lines actually begins with the text "<!--This file was moved here using FileImporter"... page_title is also available, for quick matching against a cache in order to optimize for non-FileImporter revisions. I don't see anything else that would be easy to use to distinguish between "historic" and "current" revisions, but since we have the post-FI content it would be easy to modify our comment for the current revision, to say "file was moved here using... this was the current revision at the time of import."

For file revisions, we have file_sha1 which should be plenty to match against a cache.

I was asked to quickly chime in on this, AFAIS the best way for you would be to hook into AbuseFilter-filterAction and then figure out your context based on the other variables present (yeah this is not exactly nice :/).

Sorry for being late! @WMDE-Fisch's analysis is totally correct. However, I'd suggest keeping the added_lines check as a last resort. That's something which would be very easy to spoof (although there's no obvious benefit in doing that).
Some ideas used for other variables are checking the user agent (to tell whether the mobile app is being used, code), or using a special Context (to tell whether we're on the mobile site, code).
Although I guess none of these fits this use case.

As for the hook to use: it depends on whether this information must be available for past edits. Or actually, as this is the practical criterion, on whether it's possible to tell if an already-saved revision was created by FileImporter, using only information available from a RecentChanges row. If it's possible (and I guess in this case it is), then you should use AbuseFilter-generateTitleVars, otherwise AbuseFilter-filterAction.

However, I'd suggest keeping the added_lines check as a last resort. That's something which would be very easy to spoof (although there's no obvious benefit in doing that).

Thanks, good point--that's reason enough to drop this line of investigation.

Special context sounds like an interesting workaround...

Change 517615 had a related patch set uploaded (by WMDE-Fisch; owner: WMDE-Fisch):
[mediawiki/extensions/FileImporter@master] [PoC] Add rule to be used in AbuseFilter filters

https://gerrit.wikimedia.org/r/517615

https://gerrit.wikimedia.org/r/517615

FYI: The above is a PoC how to do the requested using the cache to identify the revisions of interest.

@WMDE-Fisch Great! I think your POC is already functional; however, looking at the code and per @awight above, I'm also curious about exploring the possibility of a special context. Or actually, I think you could use a little trick with the FauxRequest: adding a bogus field in FileTextRevisionValidator::__construct (so to have e.g. new FauxRequest( [ 'fileImporter' => true ] )), and then inside the hook handler do something like $context->getRequest()->getRawVal( 'fileImporter' ) to determine what to set the fileimporter variable to.
Plus, a couple of things to fix in the patch:

  1. All extension-defined variables should be always available. So, when returning early, you should first set the fileimporter variable to false.
  2. (Just a remark) None of the proposed methods (cache, special context, and bogus request field) will allow determining whether a past edit was made via FileImporter, but that's totally fine (it just means that you have to stick to the filterAction hook).
  3. You also have to use the AbuseFilter-builder hook in order to declare that you're introducing a new variable. See for instance here. The array key is the variable name, and the value is its i18n key (see next point).
  4. You'll also need a message in i18n for the filter editor dropdown. The shape is "abusefilter-edit-builder-vars-KEY", where KEY is the one you chose at the previous point. See for instance MobileFrontend.
  5. I think "fileimporter" is fine as variable name; but if you want for instance to make it more verbose, the convention is to use lowercase+underscores.

Heyho @WMDE-Fisch and @awight,
could you help me understanding the result of the investigation? Do I understand it right, that

  • it is possible to change the AbuseFilter so that it is not blocking FileIMporter events in all cases
  • people with rights of changing rules in the AbuseFilter would be abler to use a special keyword when defining/modifying an AbuseFilter rule, e.g. something like "ExecuteThisOnOldRevisionsWhenImportingWithFileImporter = false"
  • if people are not using the keyword, we stick to the current behavior.

If that is correct, as part of the investigation (or prep for sprint planning), could you tell me the estimated story points to do so?
Thanks!

Heyho @WMDE-Fisch and @awight,
could you help me understanding the result of the investigation? Do I understand it right, that

I can try to answer the first part.

  • it is possible to change the AbuseFilter so that it is not blocking FileIMporter events in all cases

This can only be done within FileImporter's codebase, although I'd recommend against it, and IIUC it's not the idea behind this task.

  • people with rights of changing rules in the AbuseFilter would be abler to use a special keyword when defining/modifying an AbuseFilter rule, e.g. something like "ExecuteThisOnOldRevisionsWhenImportingWithFileImporter = false"

Yes, this is almost correct. More precisely, such keyword will only tell whether the current edit comes from FileImporter or not, and won't provide any additional information about the revision; however, I guess it would be possible to add another keyword to distinguish old revisions if needs be.

  • if people are not using the keyword, we stick to the current behavior.

Yes. Which means, the edit will be treated as non-FileImporter. Unless people change existing filters where necessary.

Our metrics show very few abusefilter errors, so we've deprioritized this work.

Our metrics show very few abusefilter errors, so we've deprioritized this work.

If user already knows that they'll run into an error when importing certain files and hence avoids these files until the tool has been improved, then this might contribute to lower number of errors, too. :)

I just deliberately ran into an error (twice) by trying to import a file with OTRS tag. These errors seem to have fallen into "1_2" category on metrics board, rather than into "abusefilter_warning_otrs" or "abusefilter_disallowed". Is this expected?

Our metrics show very few abusefilter errors, so we've deprioritized this work.

If user already knows that they'll run into an error when importing certain files and hence avoids these files until the tool has been improved, then this might contribute to lower number of errors, too. :)

I just deliberately ran into an error (twice) by trying to import a file with OTRS tag. These errors seem to have fallen into "1_2" category on metrics board, rather than into "abusefilter_warning_otrs" or "abusefilter_disallowed". Is this expected?

Thanks for running that test! I noticed the 1_2 error, and was... quite surprised. That shouldn't have happened, it must be a bug in how we map exceptions to error strings. We can leave the work to fix as part of this task, and will watch for more of these garbage errors.

WMDE-Fisch removed WMDE-Fisch as the assignee of this task.

I would say the investigation is done.

Change 517615 abandoned by WMDE-Fisch:

[mediawiki/extensions/FileImporter@master] [PoC] Add conditional to be used in AbuseFilter filters

Reason:

Cleaning up. Not sure if we ever want to do that/it should be done differently.

https://gerrit.wikimedia.org/r/517615