Author: mshavlovsky
Description:
We propose to implement authorship tracking for the text in the Wikipedia. The goal is to annotate every word of Wikipedia content with the revision where it was inserted, and the author who created it.
We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm:
Reads the page summary Computes the authorship for the new revision, and stores it Stores an updated summary of the history which includes also the new revision.
The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time.
The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia.
Detailed Information
We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm:
Reads the page summary Computes the authorship for the new revision, and stores it Stores an updated summary of the history which includes also the new revision.
The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time.
The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia.
Detailed Information:
https://www.mediawiki.org/wiki/User:Mshavlovsky/Authorship_Tracking
Version: unspecified
Severity: enhancement