Page MenuHomePhabricator

AbuseFilter: added_lines and removed_lines contain a line not actually changed in the diff
Open, Needs TriagePublicBUG REPORT

Description

We came across a strange, probable bug in the abusefilter on jawp.
As the title says, we found in an abuse log that added_lines and removed_lines had something that hadn't been changed in the diff.
It's like:

// Old revision  // New revision
→ {{Template1}}
 {{Template2}}  {{Template2}}
         → {{Template1}}

But added_lines and removed_lines think {{Template2}} has been changed. (If you can see the abuse log, notice that the third line in the diff isn't changed but is displayed in the variable preview.)

[
    0 => {{Template2}}
]

We suppose this is a bug. Could you have a look? Thanks.
(Note: There's a similar report on T156920)

Event Timeline

I took a quick look, it does seem to be a bug. Maybe we're not using the right diff formatter or something, I'm not sure. edit_diff is not consistent with what the actual diff is shown to be. Both of them are right, in a sense. It's just that the engine used to compute edit_diff thinks that {{Pathnav|仮面ライダーシリーズ|frame=1}} was moved up, rather than thinking that {{出典の明記|date=2022-07-08}} was moved down.

@Daimona Thanks for the quick response. I'm pretty sure this can make false positives (and in fact it did), so we'd appreciate it if anyone could fix it.

@Daimona Thanks for the quick response. I'm pretty sure this can make false positives (and in fact it did), so we'd appreciate it if anyone could fix it.

Sorry, lately I don't have much time for AF maintenance. However, while this is a genuine bug, it should never be the sole reason of a false positive. added_lines doesn't contain only the words that were added by the edit; it contains every line of text that was changed in the edit; see the manual and pages 32-34 of this presentation. If you want to check whether something was added by the edit, you need to check that it is contained in added_lines AND that it is not contained in removed_lines. Failing to do so means that the filter could have false positives regardless of how the diff algorithm chooses to split the text.

@Daimona It's fine, this would be time-consuming.
Our filters include both added_lines and removed_lines if they need to detect content addition/removal, so the false positives we came across are likely to be attributed to this bug.
Anyway, as I said above, we'd appreciate it if anyone could take care of this matter.

Our filters include both added_lines and removed_lines if they need to detect content addition/removal, so the false positives we came across are likely to be attributed to this bug.

I'm not 100% convinced. If a filter is programmed correctly, the diff algorithm it uses should not affect its correctness. If changing the diff algorithm results in a false positive, it means that the filter was already incorrect in the first place, and the only reason why it didn't have a false positive thus far was chance.

I took a quick look at the code, and it seems that my guess above is correct. edit_diff is computed with:

$diffs = new Diff( $text1, $text2 );
$format = new UnifiedDiffFormatter();
$result = $format->format( $diffs );

whereas the diff shown for AbuseLog entries is:

$diffEngine = new DifferenceEngine( $this->getContext() );
$formattedDiff = $diffEngine->addHeader(
	$diffEngine->generateTextDiffBody( $old_wikitext, $new_wikitext ),
	'', ''
);

I'm not familiar with how diff-related classes work: we have Diff, DiffEngine, DifferenceEngine, *DiffFormatter, and *DiffRendered. I honestly don't know what the differences are between these classes (I can guess some, but I don't have the full picture).

ALso, I'm guessing we could change the diffing algorithm used to compute edit_diff, but in doing so we'd also have to ensure that the performance doesn't get worse. I would also like to make sure that whatever engine/renderer/formatter we use on Special:AbuseLog is the one used for "normal" diff pages, but I'm not sure if there's a way to do that. We could also choose an appropriate formatter for the content model of that page (right now we use the formatter for plain text).

All in all, there are several changes involved here and I'm not familiar enough with the diff code to make them. I'd be happy to help fix that specific filter locally, but I will need some guidance.

@Daimona Thank you so much for your detailed inspection, and

$diffEngine->generateTextDiffBody( $old_wikitext, $new_wikitext )

I'm guessing this line is the cause of the diff-computing engine thinking that Template2 above was moved up, instead of thinking that Template1 was moved down.

Anyway, I'm going to look over the filter that had false positives when I have time, but I currently don't have much time for anything :( Let me get back to you when I've done that and if we think we need your help. Thank you again.