Page MenuHomePhabricator

Improve/Dispose AbuseFilter checks on old text revisions
Open, Needs TriagePublic

Description

Currently the FileImporter applies AbuseFilter checks for the contents of every text revision imported. - This was done to make sure, that abusive content - even if it is not part of the current revision - will not be imported to the target wiki.

From T206486 we got an example where an AbuseFilter -rule on Commons was triggered, because one of the revisions contained empty text and the rule makes sure, that empty text cannot be submitted. When "just" an old revision is affected by having no text this does not really seem to be an issue that should stop the import.

The example shows that in some cases it might not make sense to run the AbuseFilter checks for every "old" text revision. Some ideas how to deal with that:

  • dispose the checks for old revisions.
  • partly dispose the checks (e.g. only check the summary lines of old revisions but not the text)
  • find a way to tell the AbuseFilter that we're in the middle an import process and that we're dealing with "old" text revisions

Probably the last point is the best way to go here. Then the maintainers of the filter rules can decide for each rule what to do in our case. We're just not entirely sure yet if this is doable and how it should be done best.

Event Timeline

Can this tool please be configured to not import under 10 bytes file description page revisions, or at least to pad the ones under 10 bytes with spaces or the 10 byte string "Padded4AF4"? Alternatively, can it be configured to back off importation of revisions which fail any AF and import the rest (unless the latest revision fails)?

Remarks on my encounter with another abuse filter: filter 69 is set to first warn and then allow/tag the edit. In FileImporter however it doesn't allow the change. If I click "Import" repeatedly then I'm still stuck with the warning. Unlike the current situation, I'd expect that if OTRS template is not changed during the import then the abuse filter is not triggered at all.

I was able to import this file by removing the OTRS template during import (had to re-add it after import). To reproduce this you can use some other file with OTRS template, like this one.

I also observe that during import this filter tags my edit for adding OTRS template, though instead I removed it. It would make more sense if older versions that actually include the OTRS template were tagged instead. Furthermore it would make even more sense if the checks would note that the user who initially added the OTRS template actually is an OTRS member (in global group), the way it works for new revisions outside FileImporter. So the situation seems quite messy due to checks for old revisions.

Perhaps "abusive content" in older revisions isn't much of a problem and, in case this is an easy solution, checks for older revisions can be simply dropped after all?

I explored a few more ideas:

Skip old revisions

We think this is not a good idea. Old revisions that are not hidden/suppressed can contain spam, personal, or legally relevant information. The rules what should be suppressed are not the same on all wikis, not even on all Wikimedia wikis. Something that's ok-ish on the source wiki might be a problem on Commons.

Wait, are old revisions also checked when using other import methods, you might ask. No, they aren't. Why is this a problem for FileImporter then? Because FileImporter can be used by basically all users, but Special:Import can't.

Skip empty revisions

The idea is to not run AbuseFilter on revisions that don't contain anything anyway. (Except the most recent revision, which is always checked.) If there is no content, it can't contain problematic stuff, right? The only problem here is that this will skip other checks as well. For example, the summary line is not checked any more. The same legal problems as above apply.

Pad empty revisions

The idea is to always pass something to AbuseFilter that is not empty. This non-empty replacement won't be stored, it's just temporarily used to trick AbuseFilter. We can either use a meaningless string that is unlikely to trigger any filter (e.g. "This revision is actually blank, but FileImporter added this sentence to not trigger certain AbuseFilter filters."). Or we can use the content from the previous non-empty revision, or from the last revision (remember, the last revision should always be checked). Why check these revisions twice, you ask? Well, that's indeed redundant, but not the point. We don't know what all the filters do. Some might check the summary or the user, some might check combinations of fields, e.g. the previous vs. the current revision. With this idea here we still pass everything to all filters, except the empty content (which really can't trigger anything but a filter that checks if the content is empty).

What about the 10 bytes limit?

We can modify the idea above and append the content from the last revision to every small revision. This way the short text is still checked, but is long enough. Let's set the limit to 101, just to be sure (because of filters like this).

The only new issue this can introduce is when a filter checks for duplicate content or something similar that is only triggered when we combine 2 texts, but not triggered by 1 text alone. To reduce this risk we can use the lowest possible limit of 10.

Other users of the "AbuseFilter hook"

I reviewed https://codesearch.wmcloud.org/search/?q=EditFilterMergedContent and looked into what extensions actually do with the EditFilterMergedContent hook:

Callers
  • Core runs the hook only when something new is saved, never on old stuff (e.g. when importing something). The only exception is when the content model of a page is changed. This creates a new revision (which means this isn't really an exception) with the old content, but the new content model, and runs the hook again.
  • Wikibase runs the hook (because it is not using core's EditPage), but only on new revisions, as it is meant to be used.
  • GrowthExperiments does the same.
Syntax checks
Spam filters
  • AbuseFilter filters can, by design, check basically everything, including the summary, the user, and even the previous revision. Which means, for example, when we skip a check just because the content is empty, we skip checking the summary line as well.
  • MediaWiki-extensions-AkismetKlik is a spam filter that checks everything as well, similar to AbuseFilter. However, it never checks the summary.
  • ConfirmEdit (CAPTCHA extension) brings up a captcha in many different situations, e.g. when weblinks are added. As far as I can see it never checks the summary.
  • SpamBlacklist checks only weblinks, and ignores the rest of the wikitext, but includes the summary (in case it's a weblink).

Old revisions that are not hidden/suppressed can contain spam, personal, or legally relevant information. The rules what should be suppressed are not the same on all wikis, not even on all Wikimedia wikis. Something that's ok-ish on the source wiki might be a problem on Commons.

Almost all of the edit filters on Commons deal with petty vandalism, syntax errors, and user group enforcement. None of that matters unless it's in the current revision. If, for example, a vandal changed the file description to "it sucks" and was quickly reverted, there's no problem with that remaining in the history. However, FileImporter will prevent the file from being imported, even though that filter is warn-only. The revision deletion deletion of that sort of petty vandalism would likely be outside of policy as well.

The best way to fix this would be to expose FileImporter information to the filters themselves. T253876: Offer variables that surface the import steps to the AbuseFilter would be one way to do this, but would require every filter that does not need to check old revisions (which is most of them) to introduce a new condition. This increases complexity and maintenance overhead.
However, most of those filters are already filtering on the action variable. If FileImporter imports used an action other than 'edit', the majority of filters that do not apply would work as expected. Only the filters that do need to check imported revisions would need to be changed. (not sure how feasible implementation would be though)

I think all filters that apply to uploads should continue to apply, however. The content in file revisions is more visible, and the filters that target uploads are typically dealing with copyright violations.

Thanks for the helpful insight. However, we are at least partially talking about different things here.

  • Warnings do not block imports any more: T253872. This should already be deployed to the live site. We are only talking about blocking filters here.
  • I have to disagree with the idea that we can ignore old revisions, as already mentioned above. Examples:
    • Old revisions can not only contain profanity, but information that is actually legally relevant. Stuff somebody can be sued for. It might be that there is currently no filter for something like this. But we can't know if it will always be like this.
    • An attacker wants to upload a non-free file that would be blocked. It would be possible to do this by uploading the file to another wiki, adding a harmless second revision, and importing both to Commons. The non-free file can still be accessed. Somebody can still be sued. It's really hard to find something like this when it's hidden in an imported file history, and never showed up on any monitoring page.
    • An attacker can upload a large amount of spam to some unmonitored wiki, add a harmless second revision to every file, and import them all to Commons.

If warn-only filters work correctly, this is much less of an issue. However, I still think that this should be implemented with as few changes required to existing filters. The vast majority of disallow filters on Commons only make sense if they are targeting an unreverted revision. Requiring that they check another AF variable would add unnecessary complexity and create confusion.

Perhaps it would make sense to (also) show AbuseFilter a combined diff from a blank page to the current text? That would catch any unreverted issues.

It is correct that it would be possible for an attacker to upload bad stuff to xx.wiki and then import it to Commons. But FileImporter only works for AutoConfirmed users? So an attacker would have to make many good edits on xx.wiki and Commons before it will work?

So what if we either 1) only allow Commons admins to import those files or 2) upload them but add a special template + make an abusefilter that prevent users to remove the template unless they are admins?

If we chose 1) then users can just ask an admin to import it and if we chose 2) then admins have to remove the template once they verified that the file is okay.

Autoconfirmed on Commons and most other wikis have no minimum edit requirement -- just a 4-day-old account. Only a few wikis (arwiki, eswiki, ewswikivoyage, itwiki, nowiki, wikidatawiki, zhwiki, zh_classicalwiki, zhwikinews) have a required edit count greater than 10.