Host API for token persistence dataset
Open, NormalPublic

Description

This task is done when we have a batch process that generates token persistence data and an API for accessing that data in useful ways.

Token/Word persistence dataset

  • Basic schema: (token, character_offset, rev_id, page_id, user_id, revisions_persisted(rev_id, character_offset, user_id))
  • Tree structure: user -> page -> edit -> token(s) changed
  • Size calculation
    • 350 MB * 2000 * 1GB / 1000MB * 1TB / 1000GB = 350 * 2 / 1000 = 700/1000 = .7TB

Dataset

(generated for 2015-06-02)

Code

Use cases

External links

  • TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia
Halfak created this task.May 2 2017, 5:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2017, 5:04 PM
Halfak updated the task description. (Show Details)May 2 2017, 5:04 PM
Halfak updated the task description. (Show Details)May 2 2017, 5:06 PM
Halfak added a project: Analytics.
DarTar renamed this task from Host API for edit productivity dataset to Host API for token persistence dataset.May 2 2017, 10:59 PM
DarTar updated the task description. (Show Details)
DarTar added a subscriber: DarTar.
Halfak updated the task description. (Show Details)May 3 2017, 2:44 PM
Halfak updated the task description. (Show Details)
Nuria triaged this task as Normal priority.May 4 2017, 4:26 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
DarTar updated the task description. (Show Details)May 4 2017, 6:05 PM
Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Jan 3 2018, 10:50 PM
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.Feb 12 2018, 4:50 PM