Page MenuHomePhabricator

Improvements to revision culling script
Closed, DeclinedPublic

Description

Proposed improvements

  • Ideally, we'd have something capable of running as a service that could operate on an arbitrary number of tables from a single invocation, with configurable concurrency (as opposed to spinning up multiple invocations in screen sessions). Concurrency should be flexible; It's obvious that we'd need to limit overall concurrency, but it would also be useful to be able to adjust per-job concurrency (where job here means, pruning of revisions for one or more connected tables, for a wiki group).
  • In wikimedia/rb-m-t-cassandra/pull/191, the new maintenance/thin_out_parsoid.js script iterates entries in a parsoid.html table, and performs batch deletes from parsoid.html, parsoid.data-parsoid, and parsoid.section.offsets using the same PRIMARY KEY. It might be worth generalizing this so that you could specify a set of tables to operate against, with one from among the set to be used for the iteration.
  • The process should persist state. This state can be used to automatically pick up where we left off when the process needs to be restarted, and can provide the means of introspecting overall progress.
  • Should be thoroughly instrumented, with metrics sent to statsd/graphite
  • Should implement proper logging
    • w/ irrecoverable errors sent at an elevated level with all relevant data (think title/pageState/token info when a wide partition needs to be stepped over)
  • The algorithm used to determine records to cull should be configurable, or at least easily altered.

See also:

Event Timeline

Eevans moved this task from Backlog to In-Progress on the Cassandra board.
Eevans updated the task description. (Show Details)
Eevans lowered the priority of this task from High to Medium.Aug 3 2016, 7:23 PM
Eevans updated the task description. (Show Details)
GWicke subscribed.

This won't be needed any more with the move to current & recent revision storage only.