Page MenuHomePhabricator

Host API for token persistence dataset
Closed, DeclinedPublic

Description

This task is done when we have a batch process that generates token persistence data and an API for accessing that data in useful ways.

Token/Word persistence dataset

  • Basic schema: (token, character_offset, rev_id, page_id, user_id, revisions_persisted(rev_id, character_offset, user_id))
  • Tree structure: user -> page -> edit -> token(s) changed
  • Size calculation
    • 350 MB * 2000 * 1GB / 1000MB * 1TB / 1000GB = 350 * 2 / 1000 = 700/1000 = .7TB

Dataset

(generated for 2015-06-02)

Code

Use cases

External links

  • TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

Event Timeline

DarTar renamed this task from Host API for edit productivity dataset to Host API for token persistence dataset.May 2 2017, 10:59 PM
DarTar updated the task description. (Show Details)
DarTar subscribed.
Halfak updated the task description. (Show Details)
Nuria triaged this task as Medium priority.May 4 2017, 4:26 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
mforns subscribed.

From team grooming.
Declining, as we haven't had any buy in for this task.
Please, reopen if necessary.