Page MenuHomePhabricator

[Epic] Add feature annotate/blame command, to indicate who last changed each line / word
Open, NormalPublic

Description

Original task
I have had many times where I would continuously go through a history to find out who added an offending line, or a curious line which I need to contact them about. As various people mentioned here (Such as --TK) did not sign, it would take a while to figure out exactly who TK was. It would be rather nice to be able to highlight/search a line, and it would tell me times that that line was affected, which would allow me to easily find who added said line.

This is a feature request, and as so, I labeled it an enhancement, as there's no easy way to request features. Apologies if I did this wrong. I also searched "Line" and only found a very few bugs, none of which like this.

Merged duplicate task
Sometimes vandals insert a piece non-sensical information in the middle of a big article which stays covert for years. This is specially true in the Portuguese Wikipedia, which doesn't have enough task force for a immediate vandalism response. It would be great if there was a tool to "blame" a paragraph, like there is in GitHub. I mean, a tool that, given a paragraph (or a set of paragraphs), points out the last edition (or, perhaps, all the editions) in which this paragraph showed up in the differential. Sometimes it wouldn't work due to merges and displacements, but probably would for the 99%.

--Usien6 (talk) 03:06, 11 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 15 support votes, and was ranked #53 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Moderation_and_admin_tools#Paragraph_blaming_tool

Example Blame

External tools that did/do this

Further discussion

Other duplicates:

See also:


Proposed in Community-Wishlist-Survey-2016. Received 33 support votes, and ranked #46 out of 265 proposals. View full proposal with discussion and votes here.

Proposed in Community-Wishlist-Survey-2017. Received 110 support votes, and ranked #4 out of 214 proposals. View full proposal with discussion and votes here.

Details

Commits
Unknown Object (Diffusion Commit)
Reference
bz639

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

gribeco wrote:

I really would like to know who was the *first* to introduce a given
sentence/paragraph, so I can hunt down copyright violators and kill them =)

ayg wrote:

That requires considerably more complexity. You have to decide what happens
when lines are split or merged or moved, to begin with.

I think that running an annotation on a page every time it's saved would make
saving /very/ slow on pages with large histories. My suggestion would be /only/
updating the annotation for the changed lines, rather than redoing the entire
annotation.

Maybe a crazy idea, but anyway: I started using git (the version control tool used
for the linux kernel) two weeks ago and am already amazed at it's power and
flexibility. It's very fast and has good tools for searching through history.
Maybe the whole Wikipedia history could be imported into git? After that, new page
saves would be added as new commits; as this is very fast in git, it won't represent
a problem for the servers.

To make the git idea more practical, it would also be possible to have a git repository for each
wikipedia page; git is very space efficient, so this would not be a problem (I think it would
probably need less space than the DB) and the repositories could be stored on different servers.
As pages are effectively independent from each other, so a shared repository wouldn't have many
advantages.

robchur wrote:

That would require gutting MediaWiki's internals, breaking compatibility with
huge amounts of other implementations; requiring the use of another piece of
software, and *could* introduce serious performance problems, despite the "speed
of git", as it were. The current use of the database is optimised in various
places for speed and overall load balancing as it is.

A "blame" command would be nice to have, but it's going to need a sane
implementation, not a radical reorganising of literal terabytes of information.

sean_woolcock wrote:

I have had many times where I would continuously go through a history to find
out who added an offending line, or a curious line which I need to contact them
about.

Me too; it sucks!

But note that a full-on CVS/Subversion line-by-line "annotate"
command is more than this feature really needs to be. All you
really need is a box where you can type some text, and click
"Find first version of this article containing this text".

The code could just look at revisions of the article in
a binary-search fashion, so it would be fast. Here's a
quick implementation in Perl:

http://en.wikipedia.org/wiki/User:TotoBaggins#Wikiblame

ayg wrote:

Binary search is unacceptable for this. It can return incorrect results in the case of reversions.

robchur wrote:

*** Bug 9455 has been marked as a duplicate of this bug. ***

bugs wrote:

I'll repost my request 9455 here, as it's rather simpler to implement than the

original request, and possibly less expensive:

It would be useful to be able to search in the prior revisions of a page in two
modes:

  • Search backwards to find the first time when a specified piece of text appears

(ie, when it was added)

  • Search backwards to find the last time that a specified piece of text appears

(ie, when it was removed)

Ideally one day it would be great to be able to click on text and see who added.
But in the meantime, it would be great to simply be able to search for a phrase
like "He was a supporter of Hitler." and to be able to leap to the revision when
that text first appeared.

(a slightly souped up version might show a condensed history consisting of
groups of revisions where the phrase appears at least once followed by groups of

revisions where it doesn't appear at all)

I notice that it would not be susceptible to whole paragraphs being moved around
as Brion commented. Since we would only be detecting whether the given phrase
exists or not, two successive diffs where the phrase existed (but in different
locations) would be treated the same. It ought to be less expensive as there is
no diffing involved: just a simple text search: Does the phrase exist in
revision T-1? No. Does the phrase exist in revision T-2? No. Does the phrase
exist in T-3? Yes. Stop.

ayg wrote:

*** Bug 10031 has been marked as a duplicate of this bug. ***

There is an extension [1] that does this now. WONTFIX?

[1] http://www.mediawiki.org/wiki/Extension:Annotation

ayg wrote:

No. This is an important feature for reasonably effective version control and should be in core if at all possible.

inedible_bulk wrote:

I was checking out the article on Noah Webster for americanized words, and noticed that the section on it seemed to incorrectly reference american words as british, and vice versa. I wasn't sure where the problem lied (was it specifying them wrong or had they been swapped), so I checked a bit older version which had them correctly. It took a few nexts (as I had not realized it was so recent) to find the culprit:
http://en.wikipedia.org/w/index.php?title=Noah_Webster&diff=prev&oldid=166613821

Some users might have just thought that it was possibly old vandalism and just corrected it by hand. The problem there, as evidenced by the edit I link, is that there was more vandalism than just the section I had noticed it in. The benefit of a blame system shines here, where I can see which revision the edit occurred in and spot additional, previously hidden, edits.

I'm back at my bug, 3 years and 6 dupes later, and I can't really see what the exact status of this bug is. I do like the new partial undo feature though, that is really nice.

ayg wrote:

The most important point for this bug is that it's not at all simple to do with a relational database system. If we had something like git or Bazaar as a backend for revision storage, it would be trivial. The interesting questions at this point seem to be

1a) If someone were to implement version storage for MediaWiki on top of something like git or Bazaar in a manner that doesn't sacrifice existing efficiency, is the Wikimedia Foundation willing to put in the time and effort to transfer the major projects? Or even the minor projects, to start with? (Probably not going to get an unambiguous "yes" here without progress on (2a).)

1b) If so, is anyone willing to do it? (So far, no, and probably not going to be yes unless (1a) is fulfilled.)

2a) Is it possible to implement blame efficiently and scalably on top of an RDBMS? (No evidence for a yes to this that I've seen: Ambush Commander admits that his work is not efficient enough for use right now.)

2b) If so, is anyone willing and able to do it? (So far, no, and definitely not going to be yes unless (2a) is fulfilled.)

The picture is unlikely to change at any time in the foreseeable future, unless we get someone to step forward and put in a lot of work that may or may not end up amounting to something. Put another way, in standard open-source fashion: if you really want it, you're going to have to write it yourself.

  • Bug 13927 has been marked as a duplicate of this bug. ***
  • Bug 18810 has been marked as a duplicate of this bug. ***
demon added a comment.Jul 14 2009, 5:38 PM
  • Bug 18218 has been marked as a duplicate of this bug. ***
Ciencia_Al_Poder added a comment.EditedApr 21 2011, 6:53 PM

I felt also interested on it, but thinking on the day-by-day edits on a Wiki, I think that a blame/annotate SVN/CVS-like feature is not feasible in a MediaWiki installation, specially in a public one where vandalism is common.

The annotation feature makes sense on a controlled development system where changes are not very huge. But here at Wikimedia (and other public wikis) where we deal with vandalism, it's common for vandals to blank pages or large sections of a page. That defeats the whole annotation system, since all lines would be marked as changed.

Instead, the idea of Steve Bennett at Comment 26 (posted on T11455) would be more useful here, which only needs a text or pattern search of every revision text. That could also be implemented using JavaScript, retrieving every revision text trough the API and doing the search.

T11455 was closed as resolved duplicated of this one, but I think it's worth to reopen it and probably think of implementing it if this one wouldn't be implemented.

psychonaut wrote:

(In reply to comment #28)

There is an extension [1] that does this now. WONTFIX?
[1] http://www.mediawiki.org/wiki/Extension:Annotation

The WikiTrust userscript also has this functionality:
https://de.wikipedia.org/wiki/Benutzer:NetAction/WikiTrust/WikiPraise

epriestley closed this task as Resolved by committing Unknown Object (Diffusion Commit).Mar 4 2015, 8:20 AM
epriestley added a commit: Unknown Object (Diffusion Commit).
Ciencia_Al_Poder reopened this task as Open.Mar 4 2015, 8:43 PM

epriestley closed this task as "Resolved" by committing rPHABeb010b2efc71: Group inline transactions in Pholio.

WTF

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 22 2015, 3:25 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 5:48 PM
Quiddity updated the task description. (Show Details)May 31 2016, 4:09 AM
Quiddity set Security to None.
Quiddity added subscribers: Tgr, JEumerus, StudiesWorld, DannyH.
Quiddity updated the task description. (Show Details)May 31 2016, 4:16 AM
This task was proposed in the Community-Wishlist-Survey-2016 and in its current state needs owner. Wikimedia is participating in Google Summer of Code 2017 and Outreachy Round 14. To the subscribers -- would this task or a portion of it be a good fit for either of these programs? If so, would you be willing to help mentor this project? Remember, each outreach project requires a minimum of one primary mentor, and co-mentor.
devunt added a subscriber: devunt.Aug 5 2017, 5:52 AM
DannyH raised the priority of this task from Low to Normal.Jan 3 2018, 12:39 AM
DannyH added a project: Community-Tech.
DannyH moved this task from Untriaged to To be estimated/discussed on the Community-Tech board.

@tstarling and I are interested in the implementation of this. It would be a good use of the "derived" slots in multicontent revisions. You'd checkpoint the authorship map every N revisions, and store the most recent authorship non-peristently in memcached.

I don't think the derived slots idea made it into the current MCR plan. @daniel can confirm.

Maybe we could add an API to wikidiff2 to export its operation list to a PHP array, and use that to build blame maps and store them in MySQL. But what should happen if there's revision deletion?

TBolliger renamed this task from Add feature annotate/blame command, to indicate who last changed each line / word to [Epic] Add feature annotate/blame command, to indicate who last changed each line / word.Feb 8 2018, 6:33 PM

@TBolliger I think this could be useful for Anti-Harassment as well, but might require some research. Would it be helpful to know if User:Bananas has edited User:Apples edit, even though there are many edits in between?

Regardless, perhaps it would be best to add this to the MediaWiki-API so it can be leveraged in other tools (like the interaction timeline, etc.)

dbarratt updated the task description. (Show Details)Feb 9 2018, 6:54 PM

@TBolliger I think this could be useful for Anti-Harassment as well, but might require some research. Would it be helpful to know if User:Bananas has edited User:Apples edit, even though there are many edits in between?
Regardless, perhaps it would be best to add this to the MediaWiki-API so it can be leveraged in other tools (like the interaction timeline, etc.)

Something to keep in mind during development. Could definitely be helpful in determining/highlighting edit wars.

Cirdan added a subscriber: Cirdan.May 11 2018, 9:13 AM
Krenair added a subscriber: Krenair.Feb 8 2019, 1:28 AM