Improvements to revision culling script
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Eevans
	Jul 13 2016, 5:05 PM

Description

Proposed improvements

Ideally, we'd have something capable of running as a service that could operate on an arbitrary number of tables from a single invocation, with configurable concurrency (as opposed to spinning up multiple invocations in screen sessions). Concurrency should be flexible; It's obvious that we'd need to limit overall concurrency, but it would also be useful to be able to adjust per-job concurrency (where job here means, pruning of revisions for one or more connected tables, for a wiki group).
In wikimedia/rb-m-t-cassandra/pull/191, the new maintenance/thin_out_parsoid.js script iterates entries in a parsoid.html table, and performs batch deletes from parsoid.html, parsoid.data-parsoid, and parsoid.section.offsets using the same PRIMARY KEY. It might be worth generalizing this so that you could specify a set of tables to operate against, with one from among the set to be used for the iteration.
The process should persist state. This state can be used to automatically pick up where we left off when the process needs to be restarted, and can provide the means of introspecting overall progress.
Should be thoroughly instrumented, with metrics sent to statsd/graphite
Should implement proper logging
- w/ irrecoverable errors sent at an elevated level with all relevant data (think title/pageState/token info when a wide partition needs to be stepped over)
The algorithm used to determine records to cull should be configurable, or at least easily altered.
- It should be trivial to implement one-offs such as T129431: Delete or rerender all content stored under non-normalised titles

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Eevans	T140008 High RESTBase storage utilization
		Declined		Eevans	T140266 Improvements to revision culling script

Event Timeline

Eevans created this task.Jul 13 2016, 5:05 PM

Eevans moved this task from Backlog to In-Progress on the Cassandra board.

Eevans updated the task description. (Show Details)Jul 26 2016, 3:15 PM

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)Jul 26 2016, 3:47 PM

Eevans updated the task description. (Show Details)Jul 26 2016, 3:53 PM

Eevans mentioned this in T129431: Delete or rerender all content stored under non-normalised titles.

Eevans lowered the priority of this task from High to Medium.Aug 3 2016, 7:23 PM

Eevans updated the task description. (Show Details)

Eevans moved this task from In-Progress to Backlog on the Cassandra board.Aug 10 2016, 4:38 PM

Eevans moved this task from Backlog to Next on the Cassandra board.Aug 15 2016, 8:18 PM

Eevans moved this task from Next to Backlog on the Cassandra board.Nov 29 2016, 9:30 PM

This won't be needed any more with the move to current & recent revision storage only.

Improvements to revision culling scriptClosed, DeclinedPublicActions

Description

Proposed improvements

Related ObjectsSearch...

Event Timeline

Improvements to revision culling script
Closed, DeclinedPublic
Actions

Related Objects
Search...