Proposed improvements
- Ideally, we'd have something capable of running as a service that could operate on an arbitrary number of tables from a single invocation, with configurable concurrency (as opposed to spinning up multiple invocations in screen sessions). Concurrency should be flexible; It's obvious that we'd need to limit overall concurrency, but it would also be useful to be able to adjust per-job concurrency (where job here means, pruning of revisions for one or more connected tables, for a wiki group).
- In wikimedia/rb-m-t-cassandra/pull/191, the new maintenance/thin_out_parsoid.js script iterates entries in a parsoid.html table, and performs batch deletes from parsoid.html, parsoid.data-parsoid, and parsoid.section.offsets using the same PRIMARY KEY. It might be worth generalizing this so that you could specify a set of tables to operate against, with one from among the set to be used for the iteration.
- The process should persist state. This state can be used to automatically pick up where we left off when the process needs to be restarted, and can provide the means of introspecting overall progress.
- Should be thoroughly instrumented, with metrics sent to statsd/graphite
- Should implement proper logging
- w/ irrecoverable errors sent at an elevated level with all relevant data (think title/pageState/token info when a wide partition needs to be stepped over)
- The algorithm used to determine records to cull should be configurable, or at least easily altered.
- It should be trivial to implement one-offs such as T129431: Delete or rerender all content stored under non-normalised titles
See also: