Page MenuHomePhabricator

RfC: ParallelMaintenance helper class for multi-process maintenance scripts
Open, Needs TriagePublic

Description

https://www.mediawiki.org/wiki/Requests_for_comment/Parallel_maintenance_scripts

Many maintenance scripts that process a long series of independent pages or other items can benefit from parallelizing the work across multiple CPUs. This allows CPU-bound work to scale a lot faster, or service-bound work to wait on more things in parallel.

Proposed change:

(Major update 2018-09-07)

MediaWiki\Parallel\ForkStreamController is a refactoring of the existing OrderedStreamingForkController used by some CirrusSearch scripts. It takes input provided by dispatch-logic callback, passes the data as JSON serialization to child processes to run through named worker callbacks, and then collects the results over JSON again to the parent process to a result-processing callback.

ParallelMaintenance uses this interface to extend the Maintenance base class with a dispatch method setting up the workers with the --threads=N setting if present. Queued data items are routed to named methods on the Maintenance script's child process instance, and results returned to a callback in the parent scope.

Multiple methods may be used, and are set up via reflection -- methods ending with Worker are mapped to suitable names with the suffix removed. Multiple dispatch calls may be made in a script if you want to do several separate dispatches in series and need to confirm that one set was finished before the next begins.

use MediaWiki\Parallel\IController;

class Foo extends ParallelMaintenance {

  //
  // Use the execute method like any other maintenance script.
  //
  public function execute() {
    $this->output( "Starting processing...\n" );

    //
    // dispatch() sets up forked processes or in-process controller,
    // depending on the "--threads" parameter. The controller and
    // any child processes are set up at this point, and torn down
    // at the end guaranteeing all work items were run.
    //
    $this->dispatch( function ( IController $controller ) {
      //
      // Handle any input on the parent thread, and
      // pass any data as JSON-serializable form into
      // the queue() method, where it gets funneled into
      // a child process.
      //
      for ( $i = 0; $i < 1000; $i++) {
        //
        // Any public method ending with 'Worker' is transformed into
        // an event name. Here 'repeat' maps to $this->repeatWorker().
        //
        $controller->queue( 'repeat', $i, function ( $result ) {
          //
          // On the parent thread, receives repeatWorker()'s return value
          // via JSON encode/decode. Here it's a string.
          //
          $this->output( $data . "\n" );
        } );
     } );

     $this->output( "All done!\n" );
  }

  //
  // On the child process, receives the queued value
  // via JSON encode/decode. Here it's a number.
  //
  // If using any global or instance data, beware that
  // you might be running within the parent or the child
  // process, so state may change from other calls.
  //
  // Returned data is routed back to the callback given
  // on $controller->queue().
  //
  public function repeatWorker( $count ) {
    return str_repeat( '*', $count );
  }
}

The script gains a --threads=N option, and if a thread count is provided will automatically fork out separate processes, otherwise it'll process the work callback in-process.

Notes on connections and data availability:

  • creating child processes with pcntl_fork is a Unix-only thing (eg Mac/Linux); this is not currently supported on Windows hosts, but they can run a single thread in-process.
  • general MediaWiki setup state remains in memory in the child processes, but once they're forked each has an independent process
  • when a child process is forked via MediaWiki\Parallel\ForkStreamController, it closes off connections, so DB connections will be reset. They should automatically reconnect on use.
  • each child process is created once at the beginning, and will process 0 or more items during its lifetime

Some possible alternative implementations for parallel processing:

  • using pthreads instead of pcntl_fork would be more compatible with Windows, but the pthreads extension for PHP doesn't seem to be well packaged and doesn't share global state, which would complicate threading setup.
  • launching sub-processes through proc_open() and piping over stdin/out would also work on Windows, but again doesn't share global state, so would have to be able to launch a script that launches the right class.
    • MediaWiki\Parallel\ExecStreamController provides the equivalent interface over proc_open, requiring the called script to manually launch a MediaWiki\Parallel\StreamWorker. Not yet exercised but can be added with a maintenance script as a 'router'.
  • better tools for using the job queue, if you can rely on it for speed, could be useful; TimedMediaHandler's requeueTranscodes.php does manual throttling to keep from flooding the queue for instance.

Open questions:

  • should this share more with the job queue infrastructure for job -> class routing and serialization?
  • should this be expanded to be able to send jobs to the job queue?
  • bikeshed remaining naming/yak-shaving issues with the MediaWiki\Parallel\* classes and interfaces, or method naming?

Still todo in the patch revisions:

  • handle exceptions cleanly
  • test the exec mode
  • hook for closing connections on fork

Details

Related Gerrit Patches:

Event Timeline

brion created this task.Aug 14 2018, 10:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2018, 10:58 PM

Change 451099 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/core@master] ParallelMaintenance, QueueingForkController

https://gerrit.wikimedia.org/r/451099

brion updated the task description. (Show Details)Aug 20 2018, 8:48 PM
brion added a project: TechCom-RFC.

Adding to techcom-rfc board to make sure we don't forget to discuss this, if cross-cutting issues are forseen.

brion renamed this task from ParallelMaintenance helper class for multi-process maintenance scripts to RfC: ParallelMaintenance helper class for multi-process maintenance scripts.Aug 21 2018, 2:47 AM
brion updated the task description. (Show Details)
daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Aug 22 2018, 8:41 PM

The ability to communicate results back from the children is an important thing that has not been possible before, e.g. rebuildLocalisationCache.php doesn't print the number of languages rebuilt when using --threads.

TechCom is hosting a IRC meeting next week on 5 September 2pm PST(21:00 UTC, 23:00 CET) in
#wikimedia-office

brion added a comment.EditedSep 5 2018, 8:26 PM

Note via Timo -- currently we run maint scripts in production through hhvm, which sets processor affinity on all child processes. :P

Should work fine still, but won't actually parallelize until we move to zend php7. \o/

brion added a comment.Sep 7 2018, 10:13 PM

I'm retooling the proposal based on feedback. Key things:

  • switching from loop to dispatch to avoid confusion about inner loop body vs outer
  • removing exposure of the drain() method, instead you can set up runner contexts and run additional stuff after done.
  • switching from interfaces with fixed method names to callbacks, allowing multiple work types
  • using regular execute() method, with a helper fork() method (bikeshed that name!) that creates the controller, a dispatch callback, and the work/result callbacks

Considering also first-class support for multiple event types within a single go to avoid having to manually add it.

brion added a comment.Sep 7 2018, 10:17 PM

using regular execute() method, with a helper fork() method (bikeshed that name!) that creates the controller, a dispatch callback, and the work/result callbacks

Oh! I should call *that* $this->dispatch().

brion updated the task description. (Show Details)Sep 8 2018, 2:15 AM

Updated summary with the reworked API for ParallelMaintenance.

brion added a comment.Sep 8 2018, 2:33 AM

(Still have to see if the exec mode can be got working, do a little debugging, and add a hook for closing extensions.)

brion updated the task description. (Show Details)Sep 8 2018, 3:06 PM
brion added a comment.Sep 8 2018, 3:22 PM

Feedback question -- is the method name mapping from 'somethingWorker' to 'something' too clever? Should it just let you pass any method name to $controller->queue()?

Krinkle moved this task from Under discussion to Inbox on the TechCom-RFC board.EditedOct 16 2019, 10:10 PM
Krinkle added a subscriber: Krinkle.

Problem statement is clear. Proposal has been formed. No objections from TechCom and no stakeholders were identified beyond CPT and the various developers that have already participated so far, including during the IRC meeting.

Moving to our Inbox for next week. I will propose then that TechCom put this on Last Call with intent to approve 3 weeks from now.

Milimetric moved this task from Inbox to Last Call on the TechCom-RFC board.Oct 23 2019, 8:39 PM
Milimetric added a subscriber: Milimetric.

Moving to last call per decision in today's TechCom meeting, looking to approve in 3 weeks.

Feedback question -- is the method name mapping from 'somethingWorker' to 'something' too clever? Should it just let you pass any method name to $controller->queue()?

My preference would be latter. PHP doesn't have refactoring friendly way of referencing methods (unless you count [ Foo::class, 'methodname' ]) but being able to search the exact method name to see if it is used is a desired property. Apart from that, the interface seems good.

Another thought I had was whether this should be an interface (or a trait) instead of a class, in case subclassing is already used for something else -- but that seems quite rare.

+1. Looks like we'll have a 1-1 relation between ForkStreamController and ParallelMaintenance objects, as such we could presumably use the method name directly as event name without them being separate strings or concepts. That way the correlation is more obvious, e.g. $controller->queue( 'repeatWorker', 42 ); using the task description's example.

I do like the idea of enforcing the naming convention, though. In particular because it makes it easier to recognise in code that this method is "special" in that it doesn't share state with the parent and is meant to be called only by the Controller. Using a strictly-enforced naming convention is one way to do that. A few other ways that might work:

  1. Require the user to bind the method explicitly.
Procedural
public function execute() {
 $this->dispatch( function ( $controller ) {
   $controller->bindWorker( "repeat", "onRepeat" );$controller->queue( "repeat", 42 );
 } );
}
public function onRepeat( int $len ) : string {}
  1. Or, require the method to have a certain annotation.
Declarative
public function execute() {
 $this->dispatch( function ( $controller ) {$controller->queue( "repeat", 42 );
 } );
}

/** @workerEvent repeat */
public function repeatWorker( int $len ) : string {}

If this makes the overall boilerplate too long, we could potentially reduce that in turn from the other end. For example by making execute() already implemented and making the dispatch closure a pre-bound method, like so:

Shortened (procedural)
public function dispatcher( $controller ) { // called from final public ParallelMaintenance::execute
 $controller->bindWorker( "repeat", "onRepeat" );$controller->queue( "repeat", 42 );
}

public function onRepeat( int $len ) : string {}
Shortened (declarative)
public function dispatcher( $controller ) {$controller->queue( "repeat", 42 );
}

/** @workerEvent repeat */
public function repeatWorker( int $len ) : string {}
daniel added a subscriber: daniel.

This RFC is approved after Last Call per today's TechCom meeting