Overview
Early discussions around what edit actions might constitute moderation point to two core sets of actions that are difficult to detect in article wikitext and likely require article HTML diffs:
- Adjusting templates that add messageboxes to indicate general issues with an article -- e.g., en:Template:Advert which also adds a hidden tracking category en:Category:Articles with a promotional tone.
- Adjusting templates that add in-line clean-up tags -- e.g., citation needed but also many more. These generally also add hidden tracking categories such as en:Category:All articles with unsourced statements for citation-needed.
Detecting these actions via wikitext requires having a complete list of the various moderation-related templates. While past work has worked on this challenge for generating datasets of reliability issues on English Wikipedia (Wiki-Reliability), our context is a good bit different: we need an approach that will scale to all languages of Wikipedia, require minimal maintenance or manual intervention to keep up-to-date, and we don't just need a representative set but a complete set (otherwise we'll be missing an unknown amount of moderation).
As such, a much better strategy would be to rely on the HTML to detect these messageboxes and in-line tags. Specifically, messageboxes have a common CSS class across language editions. There is likely some deviation from this as not all wikis had detectable messageboxes in a 500-article sample in this analysis, but maintaining a short-list of CSS classes is much easier than maintaining a short-list of template names because most wikis tend to build their maintenance templates on a single core module that dictates the specific CSS class. That means that each language should have at most one extra CSS class that we'd have to check for. In-line tags are a bit trickier but some combination of the following should catch them:
- Checking for superscript text that was transcluded
- Requiring the superscript text to have a link to a page within the Wikipedia namespace
- Flagging changes that add a new tracking category, especially if it's linked to the superscript text
Work needed
The mwedittypes library as of version 3.0.0 has support for basic HTML diffs. Maybe that's enough but here's an overview of remaining library-level work that could be considered as part of achieving full functionality:
- Testing: the test suite for HTML diffs is quite limited which means that there are likely edge-cases that we're missing. We've largely reduced the number of actual exceptions through testing the library at scale but that doesn't guarantee correctness (Gitlab issue).
- Templates: if we want to extract actual details of what has changed about a wikitext template parameter (that might explain the changes being seen in the HTML), this information is available in the HTML as well but is only pre-parsed for top-level templates and otherwise represented as a string. So if we need to know e.g., the details of a citation template that's nested in an infobox template, that we might need to still fallback on mwparserfromhell to parse the wikitext (notebook example for first-level template details).
- Add more edge-cases across languages for CSS-based detection: for elements like messageboxes whose detection relies on CSS classes and HTML element-types, certain languages are outliers (e.g., use a div instead of table element; use their own local CSS class instead of the more generic English-style one). There are strategic choices to be made here about extending to handle use-cases (but more complicated code) vs. encoding the current default and expecting adherence (simplicity but won't work for every wiki out-of-the-box).
- Make a decision for how to handle non-whitespace-languages when summarizing text changes: right now the library splits on whitespace for whitespace-delimited languages and just counts characters for others (e.g., Chinese, Japanese, Thai). We should revisit this to both have a more consistent schema and perhaps incorporate mwtokenizer as a dependency (Gitlab issue).
Additional notes
- There are issue trackers for both mwedittypes and mwparserfromhtml on Gitlab.
- This is separate from actual reverting of edits, which can be tracked via edit tags + edit hashes and does not require inspecting a diff.