Page MenuHomePhabricator

Support posting Jobs to EventBus simultaneously with normal job processing
Closed, ResolvedPublic

Description

In order to be able to develop, test, debug and stabilize the new kafka-based job pipeline without disrupting the currently working job processing, and also to be able to switch the jobs one-by-one to the new pipeline, starting from the least complex/important jobs, we need to be able to post jobs to the Event-Platform simultaneously with posting them to the Redid-based JobQueue. It's a debugging feature that will be used during the transition period and it can go away when the transition is complete.

There're several options that we can use:

Option 1: Create an AfterJobPush hook

This is the easiest way to achieve the result right now. The hook will be executed after the job has been posted to the JobQueueRedis receiving IJobSpecification as a parameter, Event-Platform extension will subscribe to the hook and try posting the job to Event-Platform, surrounding the serialization code heavily with try/catch and logging any serialization errors. Since the new hook will only be useful during the transition in the WMF production, it will be marked as deprecated right from the beginning, clearly documented as a temporary feature and removed later.

Pros:

  • Less code right now, allows to start testing the serialization quicker
  • No hard dependency on EventBus extension being installed
  • The serialization/posting code can easily be transferred to the JobQueueEventBus implementation without changes
  • No need for mediawiki-config changes
  • Less likely to break existing pipeline

Cons:

  • Doesn't test the JobQueueEventBus right away, only tests serialization and event posting
  • The hook is by default public, so someone might start using it for other purposes even though it will be documented as temporary feature

Option 2: Create a JobQueue delegate that posts to 2 separate JobQueues

Create a delegate that would construct two JobQueue implementation - the primary and the secondary one. All the method invocations on the delegate would be passed to the primary instance, why posting will be also delegated to the secondary queue. Since this feature is needed only during the transition period, it will be marked as deprecated right away and will be dominated as a temporary feature.

Pros:

  • Makes us create and examine the JobQueueEventBus implementation earlier
  • Don't need changes to the existing EventBus extension that should be reverted afterwards
  • Less likely that someone would use the temporary feature

Cons:

  • More non-trivial code is needed
  • More likely to disrupt the current job pipeline if some of the methods is not properly delegated
  • Requires changes to mediawiki-config
  • Creates a hard dependency on EventBus extension being installed.

Event Timeline

Pchelolo renamed this task from Support posting to multiple JobQueues simultaneously to Create a hook when job is posted to JobQueue.Jul 11 2017, 5:23 PM
Pchelolo updated the task description. (Show Details)

IIRC the other option you looked into before was to create a "wrapper" JobQueue class that delegates to Redis *and* the new backend class instances?

IIRC the other option you looked into before was to create a "wrapper" JobQueue class that delegates to Redis *and* the new backend class instances?

Yes, but that actually is more heavy-weight then this solution. Here we need 2 LOC to invoke the hook, and their we need a pretty big wrapper thing that delegates all the calls properly, change to global mediawiki-config, smart way to transfer parameters to individual JobQueue implementations - so that one is way more complicated.

Also, creating a wrapper creates a hard dependency on EventBus extension being installed on the wiki which we don't really want for now I think. Not all production wikis have it AFAIK

Heh, this solution is easier but the other one is more correct. In the end, we will end up with MW posting jobs directly to the EventBus, so the hard dependency will be there no matter what (which is not a real concern since the EventBus ext is enabled on all wikis except wikitech and friends).

Creating a wrapper entails having the JobQueue classes that work with EventBus in place, which is a plus, since we can make sure they work even before we actually start processing things on our side of things. It seems to me that going with the hook approach would actually be more work, since once we switch all of the jobs to EventBus we will need to get rid of that hook at which point we will need to have the new code ready and battle-tested. Also, adding a temporary hook is dangerous :) On the other hand, it would allow us to start examining the queues sooner.

Pchelolo renamed this task from Create a hook when job is posted to JobQueue to Support posting Jobs to EventBus simultaneously with normal job processing.Jul 11 2017, 7:22 PM
Pchelolo updated the task description. (Show Details)

I've heavily edited the task description to incorporate input from @GWicke and @mobrovac

@aaron @Tgr @daniel @Joe Do you have any opinion which option to go with from the MediaWiki perspective? Any input would be much appreciated.

Option 1: Create an AfterJobPush hook

It should actually be a BeforeJobPush hook because once we start switching jobs, we don't want them to be pushed to both entities. Having a Before hook and checking its return value can inform JobQueueRedis whether a particular job needs to be pushed to Redis or not. One more disadvantage I see in this option is the fact that this hook will be somewhat non-standard as it would exist only in the specific JobQueueRedis implementation rather than in the general JobQueue interface, but that's a small detail given that it's a temporary solution.

What's the desired end state? I imagine you won't use JobQueue::pop / JobQueue::ack as using a hook wouldn't make sense then. So JobQueueEventBus will push to kafka on push, everything else would be a noop, and something else would be responsible for actually running the jobs?

What's the desired end state? I imagine you won't use JobQueue::pop / JobQueue::ack as using a hook wouldn't make sense then. So JobQueueEventBus will push to kafka on push, everything else would be a noop, and something else would be responsible for actually running the jobs?

Yes. Both options would only affect the push method, none of the consumer-side methods of the JobQueue will be implemented.

Eventually, none of the options will be needed, it's only required during the transition/testing/stabilizing period. Here's a very WIP implementation of the JobQueueEventBus for reference: https://gerrit.wikimedia.org/r/#/c/349015/

There's a short outline of the overall solution in T157088 and a presentation from the developer summit https://commons.wikimedia.org/wiki/File:Asynchronous_processing_on_the_WMF_cluster.pdf

Change 364898 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] WIP: JobQueue: Create an debugging queue

https://gerrit.wikimedia.org/r/364898

Change 364898 merged by jenkins-bot:
[mediawiki/core@master] JobQueue: Create a debugging queue

https://gerrit.wikimedia.org/r/364898

Change 368201 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@wmf/1.30.0-wmf.11] JobQueue: Create a debugging queue

https://gerrit.wikimedia.org/r/368201

Change 368206 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Labs: Enable the debug queue to test JobQueueEventBus.

https://gerrit.wikimedia.org/r/368206

Change 368201 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.11] JobQueue: Create a debugging queue

https://gerrit.wikimedia.org/r/368201

Ok, the code is life on all wikis, now it's time to start gradually enabling it. First step - enable in labs and I've created the following config change for that: https://gerrit.wikimedia.org/r/#/c/368206/

Change 368206 merged by jenkins-bot:
[operations/mediawiki-config@master] Labs: Enable the debug queue to test JobQueueEventBus.

https://gerrit.wikimedia.org/r/368206

Change 368258 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] JobQueueEventBus: Enable job events in group0 wikis.

https://gerrit.wikimedia.org/r/368258

Change 368258 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueueEventBus: Enable job events in group0 wikis.

https://gerrit.wikimedia.org/r/368258

Mentioned in SAL (#wikimedia-operations) [2017-08-02T18:11:58Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368258|JobQueueEventBus: Enable job events in group0 wikis]] T163380 Part I (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-08-02T18:13:05Z] <thcipriani@tin> Synchronized wmf-config/jobqueue.php: SWAT: [[gerrit:368258|JobQueueEventBus: Enable job events in group0 wikis]] T163380 Part II (duration: 00m 47s)

Change 370064 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] JobQueueEventBus: Enable on group1 wikis

https://gerrit.wikimedia.org/r/370064

Change 370064 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueueEventBus: Enable on group1 wikis

https://gerrit.wikimedia.org/r/370064

Change 370862 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/mediawiki-config@master] Revert "JobQueueEventBus: Enable on group1 wikis"

https://gerrit.wikimedia.org/r/370862

Change 370862 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "JobQueueEventBus: Enable on group1 wikis"

https://gerrit.wikimedia.org/r/370862

Change 370975 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] JobQueueEventBus: Enable group1 - wikidata.

https://gerrit.wikimedia.org/r/370975

Change 370975 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueueEventBus: Enable group1.

https://gerrit.wikimedia.org/r/370975

Mentioned in SAL (#wikimedia-operations) [2017-08-16T18:20:13Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:370975|JobQueueEventBus: Enable group1]] T163380 (duration: 00m 54s)

All the jobs are being posted on all wikis in production that support Event-Platform Resolving.