Page MenuHomePhabricator

[INVESTIGATION] What additional semantic primitives could be introduced to better understood VE edits at scale?
Open, Needs TriagePublic

Description

This task involves the work of investigating what new semantic primitives could be introduced to enable staff and volunteers to understand, with more granularity, the nature of the changes made within a given VE edit, at scale.

Where "semantic primitives" in this context could refer to both types of content as well as actions...

Content types
  • Sentences: the number of net new sentences that are added within a given edit. See T347644.
  • Images: the number of net new images that are added within a given edit
  • External links: the number of new external links that are added within a given edit
  • Etc.

Actions

  • Pasted text from an external source

This task builds on an existing body of work that seeks to provide the kind of "granular understanding" described above:

Use cases

We think the kind of "understanding at scale" the semantic primitives this task is asking us to identify could make the following stories possible...

  1. As an experienced volunteer who is motivated to maintain and improve the quality of content on Wikipedia, I'd value a way to filter change logs (e.g. Special:RecentChanges) for edits that involve someone introducing a number of new sentences, so that I can more easily find and focus my attention on reviewing edits that may have an outsized impact on content quality.
  2. As a developer who is motivated to maintain and improve the quality of content on Wikipedia, I'd value a way of programmatically detecting what type(s) of changes an edit has introduced/is attempting to introduce so that I can develop a feature/gadget/script that offers feedback relevant to the specific change(s) someone is seeking to make. [i]
  3. As a member of the Editing Team who is encountering a request to introduce a new potential Edit Check (e.g. mw:Edit check/Ideas), I'd value knowing what "semantic primitives" are available within VE, so that I can more easily and accurately assess the technical feasibility of said idea.

Open questions

  • 1. What is the theoretical set of "semantic primitives" that could be introduced to describe edits made with the VisualEditor and 2010 wikitext editor?
  • 2. Of the primitives "1." will reveal, what – if any – new technical capabilities would need to be introduced in order to develop/start offering them?
  • 3. How much data do we choose to expose about each primitive, bearing in mind that it's difficult to expose this data once other code has started depending on it?
    • E.g. Might we expose the number of new sentence added? Might we expose the entire contents of the sentences that were added and leave any kind of additional computation to code that is "consuming" that data?
  • 4. To what extent do we want to conform to limitations that make it feasible to provide this data on the server-side? Or, do we want a richer version on the client-side only?

Done

  • Answers to all "Open questions" are documented
  • An API is defined is specified that defines what data is passed about an edit

i. E.g. https://en.wikipedia.org/wiki/User:Suffusion_of_Yellow/wikilint.js

Event Timeline

The challenge of this sort of analysis being applied to something like Special:RecentChanges, or similar server-side tools, is that it's all the outcome of some fairly complicated processing of the data. If we want it to work server-side, we'd probably need to add a whole layer of data storage that could be queried, for which we'd need to define up-front a limited set of data that was worth extracting for use.

For someone writing a script/gadget, though, they can hook in to VisualEditor or a visual diff and access all the data about that specific edit. The main work we'd want to do here is provide helpers so they don't need to do complicated analysis themselves for common cases.

(The former is basically why we have tags, incidentally -- we preselect things we think are important and pull them out so we can filter revisions based on that. Sometimes it's data that isn't really part of the revision -- the editor used, actions taken during the edit, etc -- but other times it's things like "a notable amount of content was added" or "a citation was added" which could be worked out at any time, it's just expensive to do so in a query.)

The challenge of this sort of analysis being applied to something like Special:RecentChanges, or similar server-side tools, is that it's all the outcome of some fairly complicated processing of the data. If we want it to work server-side, we'd probably need to add a whole layer of data storage that could be queried, for which we'd need to define up-front a limited set of data that was worth extracting for use.

Excellent points @DLynch! Sharing some thoughts here that maybe help to address some of them:

  • Constraining the universe of tags: you can see the set of tags that I went with in a Python library I built for this. Could certainly make different decisions around e.g., which link namespaces get their own special tag but it's not a ridiculous number even if you decide to also differentiate between whether the particular node (reference, link, etc.) was inserted, removed, changed, or moved.
  • Tree vs. symmetric diffs: there are lots of choices when computing diffs and one major one is whether you try to account for the structure/ordering of a document (tree-based diff) or ignore that structure and just compute a basic symmetric difference of the two docs. For visual diffs, structure matters a lot (you want to see if someone is moving content around and also say exactly what changed where). For edit tags, where it's just a question of whether something changed or not, you can ignore structure and massively reduce latency (especially in the worse case scenarios). This super simplifies things because while the tree diff cannot really be parallelized and requires some very expensive comparisons that grow rapidly in complexity as the documents get larger, the symmetric difference allows you to parse the previous and current revisions in parallel and then just do a basic set comparison that only grows linearly with document size. In my edit types Python library, the tree diff is often an order of magnitude slower and requires way more memory. There's a really nice explanation by Thalia from Wikimania 2017 of how the tree diffing works (which is what I implemented in my Python library too).
  • Other causes for high latency when processing diffs: if you're doing the simple diff, then the actual symmetric difference part is essentially always cheap. The only issue that really arises is how expensive it is to parse the wikitext into a set of nodes to compare against each other. I'm working in Python with a package called mwparserfromhell and the main issues are super complicated/nested tables and when someone adds a ton of opening HTML tags without closing them -- i.e. highly nested structures. But if you're working with the Parsoid outputs, then that's already kinda handled for you and the work is just in mapping them to semantic classes that would make good edit tags (something we've started in Python with this library and that VisualEditor presumably already does to a large extent).
ppelberg updated the task description. (Show Details)

Updating task description with the idea @dchan entered offline today: the potential to think of actions in addition to content types as a type of "semantic primitive."

Context: the above was prompted by David sharing the potential for Edit Check to detect whether someone has pasted content into an article and affording volunteers the ability to prompt people to decide whether they think the content they're adding is at risk of a WP:Copyvio.

It's worth considering that "pasted" is more complicated than everything else mentioned so far, both technically and conceptually, because it's content that has meaning because a certain action was performed with it, rather than something that could be worked out by analyzing the text after the fact.

Technically, I imagine that we might model it as being annotated text, just with a hidden invisible annotation that only exists for our reference purposes, so we remember which characters came from being pasted. (This is how things like italics exist in our data model, incidentally -- they just have a visible effect.)

The interesting thing to consider about it is: how long does content count as "pasted"? Obviously, it counts immediately after it was pasted. But what if the editor starts working on it -- rearranging it, adding original-content mixed within it, deleting parts of it? (Putting it in a blockquote and surrounding it by quotation marks?) How much has to be done to it before the user would think it's us being glitchy if we're still treating it as pasted content?