Page MenuHomePhabricator

Update mwedittypes to handle HTML diffs
Open, LowPublic

Description

Overview

Early discussions around what edit actions might constitute moderation point to two core sets of actions that are difficult to detect in article wikitext and likely require article HTML diffs:

Detecting these actions via wikitext requires having a complete list of the various moderation-related templates. While past work has worked on this challenge for generating datasets of reliability issues on English Wikipedia (Wiki-Reliability), our context is a good bit different: we need an approach that will scale to all languages of Wikipedia, require minimal maintenance or manual intervention to keep up-to-date, and we don't just need a representative set but a complete set (otherwise we'll be missing an unknown amount of moderation).

As such, a much better strategy would be to rely on the HTML to detect these messageboxes and in-line tags. Specifically, messageboxes have a common CSS class across language editions. There is likely some deviation from this as not all wikis had detectable messageboxes in a 500-article sample in this analysis, but maintaining a short-list of CSS classes is much easier than maintaining a short-list of template names because most wikis tend to build their maintenance templates on a single core module that dictates the specific CSS class. That means that each language should have at most one extra CSS class that we'd have to check for. In-line tags are a bit trickier but some combination of the following should catch them:

  • Checking for superscript text that was transcluded
  • Requiring the superscript text to have a link to a page within the Wikipedia namespace
  • Flagging changes that add a new tracking category, especially if it's linked to the superscript text

Work needed

The mwedittypes library as of version 3.0.0 has support for basic HTML diffs. Maybe that's enough but here's an overview of remaining library-level work that could be considered as part of achieving full functionality:

  • Testing: the test suite for HTML diffs is quite limited which means that there are likely edge-cases that we're missing. We've largely reduced the number of actual exceptions through testing the library at scale but that doesn't guarantee correctness (Gitlab issue).
  • Templates: if we want to extract actual details of what has changed about a wikitext template parameter (that might explain the changes being seen in the HTML), this information is available in the HTML as well but is only pre-parsed for top-level templates and otherwise represented as a string. So if we need to know e.g., the details of a citation template that's nested in an infobox template, that we might need to still fallback on mwparserfromhell to parse the wikitext (notebook example for first-level template details).
  • Add more edge-cases across languages for CSS-based detection: for elements like messageboxes whose detection relies on CSS classes and HTML element-types, certain languages are outliers (e.g., use a div instead of table element; use their own local CSS class instead of the more generic English-style one). There are strategic choices to be made here about extending to handle use-cases (but more complicated code) vs. encoding the current default and expecting adherence (simplicity but won't work for every wiki out-of-the-box).
  • Make a decision for how to handle non-whitespace-languages when summarizing text changes: right now the library splits on whitespace for whitespace-delimited languages and just counts characters for others (e.g., Chinese, Japanese, Thai). We should revisit this to both have a more consistent schema and perhaps incorporate mwtokenizer as a dependency (Gitlab issue).

Additional notes

  • There are issue trackers for both mwedittypes and mwparserfromhtml on Gitlab.
  • This is separate from actual reverting of edits, which can be tracked via edit tags + edit hashes and does not require inspecting a diff.

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
ResolvedAKhatun_WMF
OpenNone
OpenNone
ResolvedAKhatun_WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
OpenNone
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenNone
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedOttomata
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenAKhatun_WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenJMonton-WMF
OpenNone
OpenNone
OpenOttomata
ResolvedAKhatun_WMF

Event Timeline

diego triaged this task as High priority.Nov 6 2024, 4:55 PM
diego added a subscriber: XiaoXiao-WMF.

@XiaoXiao-WMF this task is high priority for SDS 1.2.3, please let me know how to proceed.

diego set Due Date to Nov 21 2024, 11:00 PM.Nov 6 2024, 4:57 PM
diego changed Due Date from Nov 21 2024, 11:00 PM to Nov 24 2024, 11:00 PM.

(Moving to In progress b/c we're closing the quarterly lane today. Please resolve when done with link to output and other relevant info. Thanks.)

Updates:

  • For the work in Q2 for sds1.2.3, given the short timeline and the experimental status of the moderator actions classifier, we opted to validate the approach using heuristics (using the mwparserfromhtml directly) before committing to integrate mwparserfromhtml with the mwedittypes library itself.
  • The html dataset requested in T380874 is slated for earliest in Q4 but likely next FY
  • The task to productionize the mwedittypes library in pipelines (T351225) was moved to the freezer, I am unassigned this task for prioritization

Still worthwhile long-term task but moving to freezer until we have an more urgent need for it.

Aklapper lowered the priority of this task from High to Low.Nov 26 2025, 11:40 PM
Aklapper removed Due Date which was set to Nov 24 2024, 11:00 PM.
Aklapper removed a subscriber: XiaoXiao-WMF.

Reset Due Date, which was long ago.