Page MenuHomePhabricator

Automatically generated count and list of contributors to an article (authorship tracking)
Closed, DuplicatePublic

Description

Author: robert_horning

Description:
When trying to determine "authorship" of an article, one possible method would be to "count" the number of edits for a given article. This is particularly important when trying to determine who the "principle author" of an article might be when giving citations of the article, or for formal copyright registration.

In short, a quick count "tab" or "button" in the history page would then count each user's contributions in a fashion like this:

User1 (20 edits)
User2 (15 edits)
User3 (7 edits)
49.12.24.127 (3 edits)

To get "fancy" you could even try to eliminate counts from reversions (or even reversion wars), especially to eliminate giving credit to vandals. A simple implementation would only require a simple count.

Another further enhancement would be to list the timestamp for the last edit for each author on a particular article.

The main purpose of this is to extract the names of all authors for a particular article.


Version: unspecified
Severity: enhancement
See Also:

Details

Reference
bz2994

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:43 PM
bzimport set Reference to bz2994.
bzimport added a subscriber: Unknown Object (MLST).

rowan.collins wrote:

If the purpose is to extract the names of all authors for the article, why do
you need to count their edits? I would think that that was a terribly poor
statistic. For instance, I make heavy use of the "Preview" button, or even a
temporary page, when performing multiple or substantial modifications on an
article; other users (particularly less experienced ones) tend to save "little
and often", filling the history with multiple small changes. I don't see how any
system could overcome such biases and represent "relative contributions" to the
article.

For just listing the users from a page's history, I think the software already
has this capability, but it's not available on the Wikimedia servers for
security purposes.

For how to cite an article, see
http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia and bug 800

For determining who has "contributed most" to an article, you might be
interested in this piece of IBM research:
http://www.alphaworks.ibm.com/tech/historyflow/

robert_horning wrote:

The #1 reason for getting the edit counts would be to get a simple method to
determine who may have been major contributors to the article as opposed to
minor edits. This is not a totally consistant rule, as you pointed out Rowan,
but it at least does give a consistant basis in fact. A more comprehensive rule
would be to try and do a word count for each author, although that may take
quite a bit of server time to try and figure out. Some sort of algorithm might
be derived to determine exactly who wrote what words in a given article, but it
may be tricky to do that. I'm just trying to keep things simple here. I agree
that one author with one edit might write 90% of an article with the other 200+
edits only minor rearranging and vandalism with reverts.

The purpose of this is to primarily organize and quickly come up with the names
of all of the authors for an article, or preferably a series of articles (a
whole Wikibook, for instance) that could be used in a legal context to define
who exactly is the author of the article, and to be able to "file" formal
copyright registration. Some have told me that only the top 5-10 authors need
to be referenced in this fashion, so getting the top 10 users by edit count
would help to determine just who should be included in a formal application.
I'm not throwing out the idea that another metric could be used, but at the same
time I don't want to overload the system trying to do a whole series of reverts
to compare just who wrote what part of the article and base "ownership" on
original word count.

If the IBM technology can be distributed with and function with MediaWiki,
perhaps that is what is needed. This request, however, is for something that I
need for Wikimedia projects specifically, and en.wikibooks.org in particular.
It is a general request because I think it could be useful in other
installations of MediaWiki software.

Legally we _have_ to cite each author. The Wikipedia article on citing
wikipedia articles is wrong when it comes to legal registration practices and
formal citations in a legal setting. As far as "recommended" citations for term
papers and such, it is an easy cop-out to simply ignore the authors of the
articles altogether, and not necessarily required.

rowan.collins wrote:

Well, in my view, counting the edits by each user would just be such a poor
indicator of "relative work" as to be a waste of time. If, as you say, 90% of
the article can be owed to one one-off edit, your list of the "top 5" might as
well just be a "random 5". If you want a heuristic for who has contributed most,
look into the IBM research; if all you're after is a cheap list of authors,
either list them all or pick them at random.

I've just checked, and the software does indeed have a feature for listing the
editors of a page - via ...&action=credits - but it appears to be switched off
or otherwise unavailable on Wikipedia. (Well, the test wiki has it, anyway, see
http://test.leuksman.com/index.php?title=Main_Page&action=credits). This would
seem to me to be much what you need - why limit it any further.

You could of course use database dumps to get at this information, or indeed the
Special:Export feature which can output entire histories (although this can
presumably puts large amounts of strain on the server for heavily editted articles).

robert_horning wrote:

Is there any reason why this feature would be "turned off" from Wikipedia? (the
&action=credits feature?) Is it still considered "buggy" or does it take a lot
of server resources to accomplish? There are valid reasons required in the GFDL
where obtaining this information is not only useful but legally required. Also,
who would have the power to "turn on" such a feature for a given Wiki? As in
the typical Developer/Steward/Bureaucrat/Admin heirarchy?

Doing a DB dump seems like a waste of bandwidth, particularly when all you are
trying to do is get the credits for just a few articles. It would take me a
couple of days to download all of en.wikipedia, for instance. That really isn't
a reasonable request or expectation of a typical user.

rowan.collins wrote:

(In reply to comment #4)

Is there any reason why this feature would be "turned off" from Wikipedia? (the
&action=credits feature?)

I'm not sure, tbh - I think I'll ask on wikitech-l. But my guess is that there's
no efficient way of generating/caching this information, so that it takes large
amounts of server resources. Thinking about it, the only way I can think of
would require accessing the metadata (although not now the text, which is stored
separately) for every revision a page has ever had - which is a lot of revisions
on some pages...

Also, who would have the power to "turn on" such a feature for a given Wiki?

As in

the typical Developer/Steward/Bureaucrat/Admin heirarchy?

A developer; I imagine it's a variable in LocalSettings.php.

Doing a DB dump seems like a waste of bandwidth, particularly when all you are
trying to do is get the credits for just a few articles.

Well, if you just want a few articles, you can use [[Special:Export]] to dump
just those articles, including their history. But it's still kind of wasteful, I
agree.

artslave wrote:

There is a way to count editor contributions, since this external site by German
user Aka does it:
http://vs.aka-online.de/wppagehiststat/

(See http://de.wikipedia.org/wiki/Benutzer:Aka)

Perhaps his solution could be adapted into MediaWiki, if it's less taxing on the
database than "&action=credits".

robchur wrote:

A straight count of all revisions in an article's history wouldn't be too bad.
Grouping by username, etc. is where the fun comes in, however, since it's a more
complicated and hence longer query; ultimately, performance is affected.

jarlet wrote:

I strongly agree that there should be a better way to cite wikipedia articles and get authorship information. My concern is how wikipedia is perceived and used in academia. This in turn has implications for the quality of wikipedia. Suppose someone who is the main author of an article want to include a reference to the article in his list of publications submitted when applying for, say, tenure or grant money. The current way,

http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia,

is not good enough. There needs to be an easy way to get information on exactly what some particular author has written. I would suggest that a special link for this was available, perhaps on the history page. The url could have, for example, the following format

http://en.wikipedia.org/w/index.php?title=Genetics&oldid=225193947&highlight=Jimbo_Wales,

which should produce a standard view of the page Genetics but with were user Jimbo_Wales' contributions highlighted in, say, light yellow. This should work on a per character and not per line basis. Authorship should be preserved for moved text which seems to be possible if this is based on algorithms such as that of

http://en.wikipedia.org/wiki/User:Cacycle/wikEdDiff

I guess handling reverts of old deletions may be more tricky (preserving authorship of text which becomes reinserted by someone author than the original author).

I suppose this would require changes in the Mediawiki software so that it keeps track of the authorship of every byte in the source of each article from every version to the next; I don't see that this should lead to a massive increase in computational load if implemented properly (perhaps some downtime would be needed when making the transition...). The above wikEdDiff page mentions "integration into Mediawiki"...

Roman Nosov did an interesting blamemap extension about what you point here last year.
Guy Van den Broeck is working on a Visual Diff on this yeasr's SOC http://code.google.com/soc/2008/wikimedia/appinfo.html?csaid=9813DF0473619117

jarlet wrote:

Very nice! A live demo is still up and running at

http://91.186.7.138:9001/wiki/Freebsd?trackchanges=blamemap

sumanah wrote:

Section "3. Page-level change tracking" of the Quality section of the Feature Map here:

https://www.mediawiki.org/wiki/Feature_map#Quality:_Features_that_directly_support_quality_assurance.2C_assessment_and_labeling

mention & link to WikiBlame, Daniel Kinzler's Contributors Script, PARC's WikiDashboard, and a few other tools that individuals can use to understand who contributed to a wiki article.

Given the current options, what's the best way to move forward? Perhaps researchers who just want to cite an article's authors could use a user gadget that, for any article, generates a simple list of the authors' names and puts it on the revision history page. As for more complicated needs involving highlighting who-wrote-what, I'm not sure what the best option is.

sumanah wrote:

*** Bug 23327 has been marked as a duplicate of this bug. ***

(In reply to comment #11)

As for more complicated needs involving
highlighting who-wrote-what, I'm not sure what the best option is.

I've no idea what's the best option but WikiTrust did that and is seeking a new maintainer: http://lists.wikimedia.org/pipermail/wiki-research-l/2013-September/003068.html

(In reply to Rob Church from comment #7)

A straight count of all revisions in an article's history wouldn't be too
bad.
Grouping by username, etc. is where the fun comes in, however, since it's a
more
complicated and hence longer query; ultimately, performance is affected.

Rob did this in https://www.mediawiki.org/wiki/Extension:Contributors around 2006; I'm adding Yaron, the current maintainer, to cc. Then we have action=credits in core.

There are many ways to approach this bug, inside or outside core. The two main lines of work in core can be seen here:
https://bugzilla.wikimedia.org/showdependencygraph.cgi?id=39533&showsummary=on&display=tree&rankdir=TB

The most advanced features are unlikely to be implemented in core but it's useful to have a map of existing and possible work; I hope the graph above helps.

This needs an algorithm at least as good as that in T89763#1066043 to avoid the move and blanking issues described in e.g., http://wikitrust.soe.ucsc.edu/talks-and-papers
therefore let's do this in mediawiki-utilities: https://github.com/halfak/MediaWiki-Utilities

At T29629#323647 @demon wrote:

agree that it's probably a good idea to strike vandal accounts (does action=credits even respect RevlDel?)

This task was created before the existence of hideuser - I think that covers this.

Scott awarded a token.

Scott merged a task: T29629: Customizable summary of page editors/authors.Sun, Dec 24, 1:11 PM

@Scott: I don't see how solving this task would allow one to properly identify external authors (from outside the current wiki, and not necessarily inside some WMF wiki). Shouldn't T29629 be re-purposed only for the 2 use case listed there?

@He7d3r - you're right, sorry; looks like I glossed over that part. Done.

See also T220893: API for listing authors of an article which would accomplish this task as stated in the description (but not really stated the purpose of determining the principal author).

In T4994#5109283, @Tgr wrote:

See also T220893: API for listing authors of an article which would accomplish this task as stated in the description (but not really stated the purpose of determining the principal author).

It's not so important to determine *the* (single) main author, this feature request can be considered satisfied if an external reuser is able to "easily" determine which names to credit for copyright purposes given a limited space.

This task seems to be covered by T120738 which has received a bit more traction / attention lately, hence I'm merging this task into T120738.