Page MenuHomePhabricator

PendingChangesBot: Investigate methods to detect if pending change additions still exist in current article version
Open, Needs TriagePublic

Description

When reviewing pending changes, we need to detect whether text added in a pending revision still exists in the current version of the article. If all additions made in the pending change have been removed or superseded by subsequent edits, the pending change can be automatically accepted (or skipped), as there's no longer any content to review.

Desired bot behavior

  • Detect when text added in a pending change no longer exists (or has been substantially modified) in the current article version
  • Detect also cases where additions are only partially intact (i.e., there is new text inside the added text from other users, the addition was partially removed, or text has been moved)
  • Automatically accept or skip such pending changes, as the content is no longer present to review

Task
Investigate how detection should be done and whether we need per-revision word-level annotation indicating which revision is the origin of each word, or if an (open source) LLM can do this for us just by providing the original diff and latest revision wikitext.

Also determine if there are existing (Python) tools or libraries for this kind of work. (WikiTrust for example did this so we know it is technically possible but it was written in OCaml and has been defunct for over 10 years.)

Provide a proposal or results of the investigation in the comments, and we can write an actual task ticket based on the investigation.

Event Timeline

Hello @Zache , I made my research and came up with a proposal but I can't paste it directly due to it's length so I attached in a google docs. Here is the link: https://docs.google.com/document/d/1Y1fHYUDenIiSmQQtC7ZErrcmqlTgr9npDnLx6K5cE2o/edit?usp=sharing

Please kindly go through it

Thank you. In practical point of view the LLM:s are bad because it is slow and unrealiable and result was somewhat worse than i expected. In any case would you like to implement proposal B:s first phase?

Yes I can. Should I open an issue on Github first commencing work or I should start working and create a PR when I'm done?

Open ticket on github and make pull request to it. I will keep this ticket open if somebody else would like to take answer to it and take alternative paths.

I think that your proposal is pretty solid for making quick filtering, but for example Wikitrust style annotation has its own benefits but requires precomputing the data.