https://www.mediawiki.org/wiki/Requests_for_comment/Parallel_maintenance_scripts
Many maintenance scripts that process a long series of independent pages or other items can benefit from parallelizing the work across multiple CPUs. This allows CPU-bound work to scale a lot faster, or service-bound work to wait on more things in parallel.
Proposed change:
- core change adding ParallelMaintenance and MediaWiki\Parallel\* helpers and porting several maint scripts: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/451099
- TimedMediaHandler change using ParallelMaintenance in a test script: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/232214
(Major update 2018-09-07)
MediaWiki\Parallel\ForkStreamController is a refactoring of the existing OrderedStreamingForkController used by some CirrusSearch scripts. It takes input provided by dispatch-logic callback, passes the data as JSON serialization to child processes to run through named worker callbacks, and then collects the results over JSON again to the parent process to a result-processing callback.
ParallelMaintenance uses this interface to extend the Maintenance base class with a dispatch method setting up the workers with the --threads=N setting if present. Queued data items are routed to named methods on the Maintenance script's child process instance, and results returned to a callback in the parent scope.
Multiple methods may be used, and are set up via reflection -- methods ending with Worker are mapped to suitable names with the suffix removed. Multiple dispatch calls may be made in a script if you want to do several separate dispatches in series and need to confirm that one set was finished before the next begins.
use MediaWiki\Parallel\IController; class Foo extends ParallelMaintenance { // // Use the execute method like any other maintenance script. // public function execute() { $this->output( "Starting processing...\n" ); // // dispatch() sets up forked processes or in-process controller, // depending on the "--threads" parameter. The controller and // any child processes are set up at this point, and torn down // at the end guaranteeing all work items were run. // $this->dispatch( function ( IController $controller ) { // // Handle any input on the parent thread, and // pass any data as JSON-serializable form into // the queue() method, where it gets funneled into // a child process. // for ( $i = 0; $i < 1000; $i++) { // // Any public method ending with 'Worker' is transformed into // an event name. Here 'repeat' maps to $this->repeatWorker(). // $controller->queue( 'repeat', $i, function ( $result ) { // // On the parent thread, receives repeatWorker()'s return value // via JSON encode/decode. Here it's a string. // $this->output( $data . "\n" ); } ); } ); $this->output( "All done!\n" ); } // // On the child process, receives the queued value // via JSON encode/decode. Here it's a number. // // If using any global or instance data, beware that // you might be running within the parent or the child // process, so state may change from other calls. // // Returned data is routed back to the callback given // on $controller->queue(). // public function repeatWorker( $count ) { return str_repeat( '*', $count ); } }
The script gains a --threads=N option, and if a thread count is provided will automatically fork out separate processes, otherwise it'll process the work callback in-process.
Notes on connections and data availability:
- creating child processes with pcntl_fork is a Unix-only thing (eg Mac/Linux); this is not currently supported on Windows hosts, but they can run a single thread in-process.
- general MediaWiki setup state remains in memory in the child processes, but once they're forked each has an independent process
- when a child process is forked via MediaWiki\Parallel\ForkStreamController, it closes off connections, so DB connections will be reset. They should automatically reconnect on use.
- each child process is created once at the beginning, and will process 0 or more items during its lifetime
Some possible alternative implementations for parallel processing:
- using pthreads instead of pcntl_fork would be more compatible with Windows, but the pthreads extension for PHP doesn't seem to be well packaged and doesn't share global state, which would complicate threading setup.
- launching sub-processes through proc_open() and piping over stdin/out would also work on Windows, but again doesn't share global state, so would have to be able to launch a script that launches the right class.
- MediaWiki\Parallel\ExecStreamController provides the equivalent interface over proc_open, requiring the called script to manually launch a MediaWiki\Parallel\StreamWorker. Not yet exercised but can be added with a maintenance script as a 'router'.
- better tools for using the job queue, if you can rely on it for speed, could be useful; TimedMediaHandler's requeueTranscodes.php does manual throttling to keep from flooding the queue for instance.
Open questions:
- should this share more with the job queue infrastructure for job -> class routing and serialization?
- should this be expanded to be able to send jobs to the job queue?
- bikeshed remaining naming/yak-shaving issues with the MediaWiki\Parallel\* classes and interfaces, or method naming?
Still todo in the patch revisions:
- handle exceptions cleanly
- test the exec mode
- hook for closing connections on fork