Page MenuHomePhabricator

Edit Types: Feasibility Analysis
Closed, ResolvedPublic

Description

Determine feasibility of detecting each edit type within an edit diff. There are two main components to this:

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
ResolvedIsaac

Event Timeline

Updates:

  • The tree differ component of the project is in action! You can test it out here (provide it a language + revision ID and it'll compare that revision with the previous). Example: https://edit-types.wmcloud.org/api/v1/diff?lang=en&revid=979988715
  • The implemention being used in the API is documented in this PAWS notebook along with some speed tests: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/Tree%20Diffing%20Implementation.ipynb
  • Essentially this code allows us to isolate high-level changes within the wikitext and what they were changing -- e.g., links, text, section headers, templates, etc.. The next step is to take each high-level change and figure out what the corresponding edit actions were. For example, the tree differ can tell you that a template was changed but it won't tell you whether a parameter was just updated within that template or the template name itself was changed (in which case, it'd more realistically be described as the removal of one template and addition of another).

Updates:

  • Been working through an issue where adding a link mid-sentence (without changing the actual text) looks like changing a text node (shortening it), adding a link node, and creating a new text node. Looked through Visual Editor diffing code and they do a fair bit of post-processing that will have to probably be added here -- e.g., merging sequentially-changed nodes. They also do some fuzzy matching of nodes to look for things that weren't changed so much as moved, so that's something we'll want to add in as well.
  • Jesse has been implementing the edit types detection part and has a number of those basic types worked out (templates, categories, images, etc.). We have to decide still just how specific we're going to get -- e.g., is there a difference between changing the destination of a link and changing the link text or are both just "link changed".

Updates:

  • Tree differ now handles moves -- e.g., template removed from one place and exact same content added somewhere else.
  • Tree differ now merges all text changes that happen within a section into one text change. This will simplify work by node differ for detecting what changed and also deals with the link additions/removals because the text doesn't just include pure text nodes but also the titles of links, text in formatting tags, etc. A side effect is that changing a link could potentially be seen as a Text change and a Link change, but that's probably not actually wrong.
  • Example: https://edit-types.wmcloud.org/api/v1/diff?lang=en&revid=979988715
  • Jesse is updating the node-diffing side of the infrastructure to adapt to a slightly new data format with the above changes: https://github.com/wikimedia/research-api-endpoint-template/pull/4
  • Once those are aligned again, I will update the nice UI to show current progress

Updates:

Quick summary of current state:

UI is currently functional and can be tested in any language here: https://wiki-topic.toolforge.org/diff-tagging?lang=en

The UI should work for any Wikipedia language edition and leaving the Revision ID field blank will trigger a random diff to be evaluated, which is helpful for testing. Below I describe the current format of the output Edit Types and the three main stages to computing them. The visual depiction of the diff in the UI is just taken directly from the Mediawiki API.

Output / Taxonomy

These are current edit types (presented in a hierarchy that isn't necessary but relates to how they're detected):

  • Text (eventually the hope is to distinguish between grammar, spelling, and more substantive content changes)
  • Tag
    • Table
    • Reference
    • List item (in-progress)
    • Formatting (e.g., bold/italics)
  • Link:
    • Category
    • Image
    • Wikilink (everything else)
  • Template
  • Headings (sections)
  • External links
  • Comments

Each edit type then has four associated potential actions: insert, remove, change, move. And the number of edit types + actions are summed up across the whole diff. Most types have clear boundaries but text is aggregated by section, so making a few changes to the text will be counted just once if it's all in the same section but changing text across multiple sections will be recorded independently.

Stage 1: Tree diffing

This is the high-level determination of what changed and where in an article -- e.g., a template was changed. It's the first stage in the diffing process and is particularly helpful for detecting moves and bringing more structure to the diff. You generally won't see the outputs as they are passed on to the node differ (explained below) to process.

Stage 2: Node diffing

This is the specific determination of what happened -- e.g., a parameter was added to that template. This also does some more fine-grained disambiguation of what was changed such as whether a link was a wikilink, image, or category and whether a tag is a reference, table, list, etc.

Stage 3: Counting

This is the summary of what happened based on all the changes. While this sounds simple, it's actually one of the harder parts because it depends on a clear idea of how to interpret changes in wikitext. For example, how should one count a reference that was added within a template? Just a template edit? Or that plus a reference edit? Or just a reference edit if the template syntax wasn't altered otherwise? Currently, the edit type is just recorded for the highest-level node -- i.e. just the template regardless of what changed in the template. This is simplest and not incorrect, but I chatted with @MNeisler today about this and based on that, I think we will switch to attempting to assign changes to the lowest-level node -- i.e. just the reference if that's all the changed in the template. If e.g., a new parameter was also added though, then the template would be counted. This likely won't be perfect but hopefully good enough. Code/results are part of the node_differ above so no separate endpoint.

Updates:

  • Moved code to new repository so it could be more effectively isolated for testing and eventutally package management (available via pip so easy to use in Spark etc.): https://github.com/geohci/edit-types
  • Added some basic tests for the tree diffing side of the code. These both help with development and defining expectations for the code (which can otherwise be somewhat ambiguous given that there is no right answer for diffing)
  • Some slowness on node diffing side due to outside factors but still moving forward and in good shape

Updates:

  • Now tracking bugs as they come as Issues on Github: https://github.com/geohci/edit-types/issues
    • This seems to be working much better for this style of collaboration where I make some code tweaks but am often passing off requests to the contractor to handle. It's much easier to track and forces me to be clear about what the fix is
  • Added some more test cases, better handling of category/image changes (no longer trigger text changes), and documented some additional bugs to fix

Also closing out task for Q2:

  • Diff approach pretty full-fledged now. We still have some kinks to work out but the main diff-related blocker right now is handing nested nodes -- e.g., references within templates. We have charted a path for that though that has us check for various nested nodes when things like templates are changed. If that's too intensive, we also have the ability to iteratively do the full diff on each object -- e.g., tree-diff the article, then iteratively tree-diff each component that was changed until it hits some base element like text or a link.
  • Text is the main edit type that we haven't touched much. We've been discussing approaches though and our early iterations have raised some clear use-cases to cover -- e.g., just changing white space happens frequently when e.g., someone adds/removes a category or template.

As always, current model for testing: https://wiki-topic.toolforge.org/diff-tagging?lang=en