Page MenuHomePhabricator

old_wikitext and new_wikitext are computed differently for MassMessageListContent
Open, Needs TriagePublicBUG REPORT

Description

The old_wikitext variable is in the expanded json form, while the new_wikitext variable is stripped of white space, leading to the entire page content being included in removed lines and added lines and resulting in triggering abuse filters with false positives

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
DannyS712 changed the subtype of this task from "Task" to "Bug Report".

Example: http://meta.wikimedia.org/wiki/Special:Diff/20513761 and https://meta.wikimedia.org/wiki/Special:AbuseLog/1076203
The diff shows the removal of a single line, User talk:DannyS712
However, when the old and new wikitext for the edit were computed for filters, the results were

old_wikitext
{
    "description": "Filler description",
    "targets": [
        {
            "title": "User talk:DannyS712"
        },
        {
            "title": "User talk:DannyS712/sandbox"
        },
        {
            "title": "User talk:DannyS712",
            "site": "en.wikipedia.org"
        }
    ]
}
new_wikitext
{"description":"Filler description","targets":[{"title":"User talk:DannyS712/sandbox"},{"title":"User talk:DannyS712","site":"en.wikipedia.org"}]}

Each of the lines of the old wikitext was included in removed_lines, and the entire new wikitext was considered a single line in added_lines

Is one of those variables not going through PST?

Is one of those variables not going through PST?

Yes, new_wikitext; not during the edit, at least. Performing a PST would also help with other bugs, e.g. T198651, see r443854. That is blocked on T264104, which in turn is waiting for opinions by the team.

Thinking about this again, I don't think this is going to be fixed. new_wikitext is meant to represent raw text, i.e. no PST. This is what allows you to create filters like added_lines contains '~~~~' etc. In lack of a better solution, this should probably be closed as invalid.

In that case, how can we detect the addition or removal of entries from mass message lists without treating every edit as a complete replacement of the content?

In that case, how can we detect the addition or removal of entries from mass message lists without treating every edit as a complete replacement of the content?

Uhm, how is whitespace relevant in determining what was added/removed? You can use the same code that you would have used with pretty-printed JSON (since parsing JSON from AF is not an option).

In that case, how can we detect the addition or removal of entries from mass message lists without treating every edit as a complete replacement of the content?

Uhm, how is whitespace relevant in determining what was added/removed? You can use the same code that you would have used with pretty-printed JSON (since parsing JSON from AF is not an option).

The added_lines/removed_lines - abuse filters think every line was removed, and a single line with all of the content was added

added_lines is not meant to represent "everything that was added", although this is a common misconception (and the dual is true for removed_lines). There's plenty of situations where some text (e.g. a paragraph) can appear in added_lines without being really added. What added_lines represent is just "the RHS of a diff", aka a shorter version of new_wikitext.

What were you trying to use to detect the addition assuming that new_wikitext would have been pretty-printed?

added_lines is not meant to represent "everything that was added", although this is a common misconception (and the dual is true for removed_lines). There's plenty of situations where some text (e.g. a paragraph) can appear in added_lines without being really added. What added_lines represent is just "the RHS of a diff", aka a shorter version of new_wikitext.

What were you trying to use to detect the addition assuming that new_wikitext would have been pretty-printed?

In this case, there were false positives on an antispam filter (https://meta.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchUser=1.136.110.229) when someone was trying to remove an entry from a mass message list, because the remaining text included the problematic phrase and the abuse filter thought that all of the text was newly added

added_lines is not meant to represent "everything that was added", although this is a common misconception (and the dual is true for removed_lines). There's plenty of situations where some text (e.g. a paragraph) can appear in added_lines without being really added. What added_lines represent is just "the RHS of a diff", aka a shorter version of new_wikitext.

What were you trying to use to detect the addition assuming that new_wikitext would have been pretty-printed?

In this case, there were false positives on an antispam filter (https://meta.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchUser=1.136.110.229) when someone was trying to remove an entry from a mass message list, because the remaining text included the problematic phrase and the abuse filter thought that all of the text was newly added

That's not the correct way to write a filter for this purpose, regardless of PST, whitespace, MassMessage, and content model. It would have the same problem as

added_lines contains "mybadword"

when executed on

old_wikitext
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore mybadword aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
new_wikitext
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore mybadword aliqua foobar. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

The good way to go is check added_lines AND removed_lines.

P.S. What you're looking for is something like T220764