This task is done when we have a batch process that generates token persistence data and an API for accessing that data in useful ways.
Token/Word persistence dataset
- Basic schema: (token, character_offset, rev_id, page_id, user_id, revisions_persisted(rev_id, character_offset, user_id))
- Tree structure: user -> page -> edit -> token(s) changed
- Size calculation
- 350 MB * 2000 * 1GB / 1000MB * 1TB / 1000GB = 350 * 2 / 1000 = 700/1000 = .7TB
Dataset
(generated for 2015-06-02)
Code
- https://github.com/halfak/measuring-edit-productivity
- Basic library: https://github.com/mediawiki-utilities/python-mwpersistence
Use cases
- https://meta.wikimedia.org/wiki/Research:WikiCredit
- https://en.wikipedia.org/wiki/Wikipedia:WikiBlame