Page MenuHomePhabricator

Implement concurrency limiting on change propagation & enforce conservative limits and delays
Closed, ResolvedPublic

Description

Kafka client fetches the messages as fast as possible, so the only limit of the execution concurrency is CPU. The problem is that most of the messages in the topic might not match the rule, so processing them is incredibly fast. The commit function in the kafka client implicitly commits the offset of the latest consumed message, thus making the whole commit feature pointless, as by the time an actual execution of some message will finish (or crash), it's almost guaranteed that some newer non-matching messages were already consumed and newer offset was committed.

Instead, we propose to implement the task queue. The task queue internally would be an object, mapping from a message offset to a promise of a task execution.

  1. When the message is received, an executor task is pushed to the queue using a message offset as a key. If the queue size is over the limit, the consumer is paused. If the message doesn't match, we do not push it to the queue, but also do not invoke a commit afterwards.
  2. When execution finishes, and appropriate action was successfully made, either the update was run or a retry was scheduled, or a fatal error sent to the queue, the lowest offset in the queue - 1 is committed (if it was not already committed) and the consumer is resumed.

If, however, the worker crashed without taking an appropriate action, after restart it will continue from the committed offset, which would be the lowest offset of the successfully executed tasks. This will imply some double-processing, but since worker crashes are not really expected, it should be fine.

Event Timeline

We should also significantly increase the retry delays & limit retries in production. Currently, the config uses a delay of only 500ms, and up to five retries (six requests total).

Change 287148 had a related patch set uploaded (by GWicke):
Set conservative retry limits & delays

https://gerrit.wikimedia.org/r/287148

@Pchelolo, the back-off in that patch is nice, but the base delay is still way too low in my opinion. These background updates aren't that time-sensitive, so I think it's okay (and a lot safer) to delay the first retry by at least a minute.

Another patch to make the defaults more conservative is now available at https://github.com/wikimedia/change-propagation/pull/26.

GWicke renamed this task from Implement concurrency limiting on change propagation to Implement concurrency limiting on change propagation & enforce conservative limits and delays.May 5 2016, 11:02 PM

PR #28 partially deals with this issue by implementing concurrency limiting in Change Propagation.

Gerrit 288375 brings a wealth of improvements, such as better handling of retries and more conservative delays for them, no automatic redirects and concurrency limiting.

Change 288376 had a related patch set uploaded (by Mobrovac):
Change-Prop: Limit the number of concurrent tasks

https://gerrit.wikimedia.org/r/288376

Change 288376 merged by Gehel:
Change-Prop: Limit the number of concurrent tasks

https://gerrit.wikimedia.org/r/288376

Change 288580 had a related patch set uploaded (by Mobrovac):
Change Prop: Increase heap limit to 750 MB

https://gerrit.wikimedia.org/r/288580

Change 288580 merged by Giuseppe Lavagetto:
Change Prop: Increase heap limit to 750 MB

https://gerrit.wikimedia.org/r/288580

PR #29 implements the second part, which is committing the offsets of executed messages, not read ones.

Change 288942 had a related patch set uploaded (by Mobrovac):
Update change-propagation to 0571b9e

https://gerrit.wikimedia.org/r/288942

Change 288942 merged by Mobrovac:
Update change-propagation to 0571b9e

https://gerrit.wikimedia.org/r/288942

The feature has been running in production so resolving. Let's track any issues and improvements in separate tasks.

Change 287148 abandoned by Hashar:

[operations/puppet@production] Set conservative retry limits & delays

Reason:

Last update in 2016 and T134456 got marked resolved a few days after this proposed change.

https://gerrit.wikimedia.org/r/287148