Page MenuHomePhabricator

Allow easy tuning of the jobqueue concurrency.
Open, LowPublic

Description

There are several cases when some event (an edit to a very popular template, for instance) triggers a huge amount of jobs to be enqueued for one specific type and one wiki. In those cases, with the current jobqueue we're completely unable to react to such events raising the number of workers for that specific job type/wiki.

We want the new transport to be smarter, and in fact I know changeprop has better handles already for concurrency. What I would like to have is the ability to change concurrency quickly to react to some event, thus without going through the cycle of puppet patch/review/merge/apply/restart we need to go through with the current jobrunner (which isn't able to raise the concurrency for a specific wiki either).

So ideally operations folks would like to have what follows:

  • We have a global concurrency for all of changeprop requests. Be it truly global (across the cluster) or local to a specific instance. This will allow us to fine-tune the number of running jobs on either side - have the same number of requests as the number of hhvm workers we globally dedicate to this duty.
  • Each job type *can* have a weight, with the default being a weight of 1. Each job i will then have a maximum concurrency of max((w_i/sum(w_j)*global concurrency, 1)
  • We should be able to change the weight of a job type without a code review or a full restart of the service
  • Ideally, we should be able to modify the weight for a specific wiki too, but this is just a nice to have in my opinion.

To this aim, given service-runner has the ability to reload config files, it could be enough to have a dedicated file including the concurrency setting generated from etcd via confd, and then send a signal to changeprop in order for it to re-read the configs.

Event Timeline

Joe created this task.Sep 13 2017, 8:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2017, 8:29 AM

Currently ChangeProp indeed only supports concurrency per rule (per job type) and it's hard coded in the config, so although you don't need a puppet patch to change it, you still need a full deploy of the service.

Internally it's implemented as a number of promises allowed to be pending in any given moment, so it's basically a list of slots where we push a promise when we begin processing and remove it when we finish with the list being of a fixed length, so changing the concurrency dynamically is not a big deal.

Doing per-wiki concurrency is a bit trickier - we don't partition Kafka topics by wiki, so events in each topic are shuffled. If we want to support per-wiki concurrency, we'd probably need per-wiki blocking queues, but then it's not clear what to do if, let's say, you fetch an event for enwiki and fill up enwiki queue, then what do you do next? If you stop fetching events, you might not reach the max concurrency for other wikis and if you continue and the next event happens to be enwiki as well, you overfetched. And fetching everything and storing in memory is not an option - the backlogs might be too big so we can simply run out of memory. Also by nature of Kafka unless we partition topics by wiki it doesn't make a lot of sense to have per-wiki concurrency - we won't be able to use it to clean up the queue as the offset commits can't skip anything. Whether we should partition by wiki (group of wikis?) is I think a separate question.

As for fetching from ETCD - I think we should generically add this feture to service-runner config - just declare ETCD("bla") and make it fetch the data when config is loaded. We already have the ability to do the same with environment variables, this shouldn't be much more complex.

GWicke lowered the priority of this task from Normal to Low.Sep 13 2017, 5:46 PM
GWicke edited projects, added Services (designing); removed Services.

We briefly discussed this during today's sync meeting. While there are ways to set up targeted processing priorities for specific jobs (by wiki, type, or other criteria), we realized that there will likely be less of a need for this in the new setup. The Redis job queue divides processing throughput evenly between projects. This makes it relatively likely for individual projects to accumulate large backlogs, which would then need manual intervention (re-prioritization) to address.

In the new system, jobs of a specific type are all handled in the same FIFO queue. This avoids building up a long backlog in specific projects at the cost of slowing down the processing of a specific job across all projects.

Overall, we agreed to put this on the backburner for now, and revisit once we have established a need for this in the new system.

fdans moved this task from Incoming to Radar on the Analytics board.Sep 21 2017, 4:33 PM