Page MenuHomePhabricator

Remove ContentTranslation code that emulates AbuseFilter, because it's hard to maintain
Open, HighPublic

Description

ContentTranslation has a dedicated class, AbuseFilterCheck, used to preemptively check filters while the translation is being written. TTBOMK, it first checks the page as a whole, and then the text of each translation unit.

This is done by putting together various hacks, and using internal AF logic [1] in a way that is not easy for me to maintain (because these internals must be kept exposed somehow). On top of the maintenance burden, this has also been the case of issues in the past, e.g. T155897.

Additionally, the error messages shown by ContentTranslation are worsening the user experience, because it's just using a generic error message for all actions. It doesn't make any distinction between warn, disallow, block, etc., and for 'disallow' it doesn't show the specific message used by the filter (it does so for warn with another hack, AFAICS).

All in all, this code is problematic and IMHO it should be removed to avoid maintenance burden. As a replacement, I'm proposing the following:

  • CX should run a hook before saving the translation; the hook would work similar to EditFilterMergedContent
  • AbuseFilter would add a handler for that hook, and run everything as it usually does
  • CX would receive a list of messages that can be shown above the interface

Thoughts?

[1] - Which, being crappy legacy code, is obviously not marked as internal but as public static

Context: this code was added to address T114621

Related: T134678, T134681

Event Timeline

Daimona triaged this task as High priority.Oct 24 2020, 9:05 AM

I can try implementing this, but I'd like to hear some opinion from the maintainers of CX. This is quite important because, as I mentioned in the task description, it forces us to keep BC code around.

Having such a hook would be great regardless of the current use case, IMO, but the lack of T266907: AbuseFilter should provide error/warning messages in an API-friendly plaintext format makes it somewhat complicated for features with a more polished user interfaces to use AbuseFilter warnings as they are.
Of course, even just upstreaming the code to the AbuseFilter codebase and keeping the generic ContentTranslate warning as it is, would be an improvement.

The proposed approach looks ok to me and appreciate your offer to help implementing it. Ideally, an API as @Tgr mentioned would help us more to trigger these validations at client side and make abusefilter decoupled completely from CX. As the existing code shows, we need to do title validation and section content validation.

The reason for having CX's interface to abusefilter is to have flexibility to interact with abusefilter and show results as per CX's UX design.

Our team is working under capacity and we have planned deliverables this month. Also we were trying to avoid code changes to CX because it is heavily used these days and team is not capable of focus shifiting for that changes. So, please expect delays in code reviews even though we will try our best to help.

Marrry333 subscribed.
This comment was removed by Reedy.

@santhosh, @Nikerabbit I've looked into this again, but I don't understand. What the CX code does is to anticipate AbuseFilter warnings *before* the translation is published. That is, while you are translating, you're progressively given warnings if any filter matches what you're writing. This is achieved by separately checking the sections and the title.

However, AFAICS, if you disable that code and complete your translation (without getting any pre-emptive warning), and then publish the translation, you'll receive an error message from AbuseFilter as usual. This message appears on top of the editor and is similar to the one used for normal edits:

Automatic edit filters have identified problematic content in your translation. Filter hit: Test cx

At this point, my question is: why do we need to anticipate warnings while the translation is ongoing? Not only it requires some hacks to invoke AbuseFilter, but it's also not guaranteed to work. In fact, it might give misleading warnings: filters can be very complex, and they often look at the whole picture (i.e. the whole content of the page). Checking just a single paragraph is likely to produce unwanted warnings. As a case in point, if the filter prevents adding the word "FOO" on pages *without* the [[Category:Foobar]], and you're translating a page in the "Foobar" category, when the text paragraph is sent to AbuseFilter, it won't know what the categories of the page are, and the filter will be triggered.

Thus, my current proposal is: can we just remove the code that anticipates warnings? We can improve the error message that users get when hitting "publish", but that doesn't require emulating AbuseFilter.

@Daimona That is a product decision and @Pginer-WMF may give more details.

Personally I think that if you translate an article with lot of an effort, and when you finally try to publish you get an error you may no idea about how to fix or even what causes it, it is a very bad user experience.

I believe one of the reason why are checking individual paragraphs is that AbuseFilter doesn't tell where in the content the issue is.

Personally I think that if you translate an article with lot of an effort, and when you finally try to publish you get an error you may no idea about how to fix or even what causes it, it is a very bad user experience.

That is correct, but it goes for every kind of edit. People who write filters should explain exactly where the problem is by using the customizable error message. AFAIK, this is usually the case.

I believe one of the reason why are checking individual paragraphs is that AbuseFilter doesn't tell where in the content the issue is.

While this is correct, there's an important thing to note: AbuseFilter is much more powerful (and has a higher expressive power) than, say, SpamBlacklist or TitleBlacklist. A "piece of content where the issue is" doesn't always exist. There are tons of factors that can cause a filter to match, and it's very easy to create complex combinations of expressions. This is why no progress was made in T72152: generally speaking, it's impossible (and wrong) to attribute a filter match to a single piece of text.

That said, the status quo isn't much better. Right, it does highlight the unit where the error happens, but as I said, it's conceptually wrong to run a filter on a single section. That's just not how it works, AbuseFilter needs to look at the whole content of the page to return a meaningful result. The current hack can result in:

  • Showing a warning even if the filter wouldn't match while publishing (because it was executed without the full context)
  • Not showing a warning even if the wilter will match while publishing (for the same reason)

I can think of many real-world filters (or, more specifically, filter conditions) that can cause the cases above to happen.

{{ping}} @Pginer-WMF ?


Two other things that came to mind:

  • If a filter isn't looking at the page content but just at the user, then CX would show a warning for each section
  • T269253, which might be tolerable on its own, but IMHO it's not if we add up the other reasons I mentioned above

I strongly suggest removing this code straight away, and waiting for T174554 / T216001 instead, if the intent is to really point the user to a specific section.

Thanks for the input @Daimona. I'll try to provide some context.

We worked on this feature after observing users that were not able to publish because they hit an abuse filter and they were not able to identify which was the cause.

An example: some wikis don't allow Youtube links. When translating an article from a wiki that allows them into another one that does not, you may get an error that prevents from publishing if the source article has such link and it got propagated to the translation. Since the translator focuses mainly on adapting the language of the article is very possible that is not aware of where the problematic link is. Then in the best case you have to chase each link including the references to find out the small piece of content that is preventing to publish. On a long translation even knowing which paragraph had the issue can make a difference.

In general I think it is a good practice to surface errors as early as possible (as long as the user was given a chance to edit and we avoid false positives) and do that in a contextual way. We worked around the limitations of the current system to try to approximate that.

I agree that this has many limitations, as you pointed out. However despite the conceptual limitations and the cases you described, most of the filters we found causing problems were actually about the content. Removing the current system leads us to the previous state of the tool which we know it has problems when users try to publish. So we need to carefully think which kind of problems are less painful for our users if there is no alternative.

Having said all that, the ideal solution would be for abuse filters to provide more contextual feedback (T174554) and allow to define some rules as being safe to be checked in real time. this would allow all kinds of editors (not only Content Translation) to provide early and contextual guidance to users.

An example: some wikis don't allow Youtube links. When translating an article from a wiki that allows them into another one that does not, you may get an error that prevents from publishing if the source article has such link and it got propagated to the translation.

Yeah, this is quite a good example, although I believe people would use Spamblacklist for a similar purpose. Unlike AbuseFilter, SB is much more straightforward, for you can run its regexp against a single section and keep a ~100% accuracy (unless there's a regexp meant to span multiple paragraph, which I can't really imagine in a spam blacklist).

In general I think it is a good practice to surface errors as early as possible (as long as the user was given a chance to edit and we avoid false positives) and do that in a contextual way.

This is true for "soft" errors, like the one you mentioned about links. OTOH, I'd disagree with this if the "error" we're talking about is for adding some purely vandalic pattern (a swear word or something). In this case, not only we don't want to warn the user as soon as possible, but we also don't want to tell them what exactly tripped the filter. There's no way to determine whether this is the case for a given filter, but excluding "block" from the list of serious actions would probably help.

Having said all that, the ideal solution would be for abuse filters to provide more contextual feedback (T174554) and allow to define some rules as being safe to be checked in real time. this would allow all kinds of editors (not only Content Translation) to provide early and contextual guidance to users.

That's a good point. T174554 itself is IMHO too wide, and it again suffers from a simplistic view of AbuseFilter seen as just a list of regexp that are matched against the page content. Instead, I'd rather go with T216001, which is more limited in scope, and also easier to achieve, plus it would delegate to filter editors the choice of what might or might not be seen in a warning message.

I'm unsure whether T216001 will be done as part of the current overhaul, but I guess we can keep the CX code until that is resolved. I just wainted to point out that this code is quite hard to maintain because it's using internals that were not meant to be exposed.

Having said all that, the ideal solution would be for abuse filters to provide more contextual feedback (T174554) and allow to define some rules as being safe to be checked in real time. this would allow all kinds of editors (not only Content Translation) to provide early and contextual guidance to users.

What about adding a new flag (metadata) for each abuse filter to specify if the intended purpose of that filter can only be checked if the whole page content is provided to it? Then, a filter that only checks if the page contains YouTube links could be marked as safe for "checking a section of the text as if it were the whole text", while a filter that blocks e.g. usage of an expression on pages containing category X would not be safe, as they require looking at both the page text and the category (which is usually at the end of the page, not repeated inside each section).

Is there a bug for that already?

Having said all that, the ideal solution would be for abuse filters to provide more contextual feedback (T174554) and allow to define some rules as being safe to be checked in real time. this would allow all kinds of editors (not only Content Translation) to provide early and contextual guidance to users.

What about adding a new flag (metadata) for each abuse filter to specify if the intended purpose of that filter can only be checked if the whole page content is provided to it? Then, a filter that only checks if the page contains YouTube links could be marked as safe for "checking a section of the text as if it were the whole text", while a filter that blocks e.g. usage of an expression on pages containing category X would not be safe, as they require looking at both the page text and the category (which is usually at the end of the page, not repeated inside each section).

Is there a bug for that already?

I could not find a specific ticket to mark abuse filters as valid for real-time check.
However, there is this more broad exploration about surfacing issues earlier: T265163: Create a system to encode best practices into editing experiences

There I reported (T265163#7023128) the particular usecase of abuse filters in Content Translation.