Page MenuHomePhabricator

The consumer job of the SUP does not achieve its expected throughput
Closed, ResolvedPublic8 Estimated Story Points

Description

The consumer job (indexer) is not able to keep up the current update rate despite having the AsyncIO operator configured with a capacity of 100 (which means 100 concurrent requests).
Here are the actions taken so far and to be taken (please add more):

  • rule out the elasticsearch sink as the bottleneck using a nullsink
    • writing to /dev/null did not help throughput
  • fix the async http client config to be aligned with the asyncIO capacity
    • helped a bit but did not see the number of pending requests still not capped at 25 (the new reduced capacity)
  • switching to UNORDERED did seem to help throughput
  • bumping from 25 to 100 keeping ORDERED did help
  • separating the fetch operators with an ORDERED one for rev_based updates and an UNORDERED one for re-render
    • did not seem to have the expected outcome, throughput remains identical but having a more granular streamGraph shows that the backpressure is caused by rerenders
    • actually this fix was not properly made and ended up with two ORDERED async operators
    • second attempt at: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/93
    • throughput is much higher after deploying the patch above, processing between 600 and 800 records per sec with a capacity of 200: 2tm * 100 (0.85/0.15 for rerenders/revbased)
  • bump k8s resources (mem: 2G->3G, cpu: 1->2), there was some suspicion of k8s doing cpu throttling and the jvm doing excessive young gc
    • did not seem to have an impact, the cpu throttling reduced and young gc times as well but throughput remained roughly the same
  • It seems plausible that flink defaults are tuned for relatively small event sizes, but our events are 10's of kb, up to several mb. Ran test of full pipeline but skipping the content merge to see if it is having an effect on throughput. Test result was that it ran basically the same with the change.
  • better understand how the async http creates its threadpool (seems to rely on getRuntime().getNumberOfProcessor() which might be incorrect on k8s) and figure out if fine tuning IOReactorConfig.Builder#setIoThreadCount manually could help.
  • ruleout envoy as the bottleneck
    • envoy is configured to have practically no limits (50k concurrent requests allowed)
    • container seems to be heavily cpu throttled (see T353460#9407621)
    • de-activating envoy (talking directly to the mw-api-int endpoint) we saw greater throughtput with much fewer retries (envoy being heavily cpu throttled), it suggests that envoy might need some tuning to work properly with the throughput we expect.
    • see T354517 for further tweaking

With a given 100 capacity for the async operator spread as 15% for the revision based updates and the rest for rerenders we should expect to see 160 rps made the mw-api-async-ro (assuming an avg latency of 0.5s), but we only see ~70 currently.

AC:

  • throughput limitations are better understood
  • the SUP should have the proper knobs to effectively tune its throughput.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 983208 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: bump envoy resources

https://gerrit.wikimedia.org/r/983208

Change 983208 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: bump envoy resources

https://gerrit.wikimedia.org/r/983208

Bumping envoy resources did help a bit, cpu throttling is reduced (still a bit present tho):

image.png (453×1 px, 125 KB)

Rps can rise to ~160 but hard to tell if it was helped by this change or not, the latencies dropped drastically 30mins before the change was deployed:

Capture d’écran du 2023-12-14 18-57-32.png (1×3 px, 550 KB)

Gehel set the point value for this task to 8.Dec 18 2023, 4:26 PM

Change 987973 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/987973

Change 987973 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/987973

@pfischer send me here with the results from your consumer-devnull tests. We have not done excessive testing with this but it might help a lot to reduce the concurrency for envoy (which defaults to the number of CPUs on the node; https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy).

I would start by setting .Values.mesh.concurrency to 2 (max(ceil(<cpu-limit-in-whole-cpus>), 2)). We've not done explicit testing with that but this actually seems like a good occasion to do so.

Change 988017 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/988017

Change 988017 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/988017

@JMeybohm, thanks! That brought down the throttling.

As of now, it looks like discarding the order of events for re-render updates resolves the throughput issue.

better understand how the async http creates its threadpool

We should at least log the JVM point of view on how many CPUs it sees, to make sure this has the expected value (= 1).

Change 988675 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/988675

Change 988675 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/988675

Regarding IOReactorConfig

consumer-devnull (no resource request/limit overrides) correctly gets ioThreadCount=1, see logs:

Async HTTP client I/O config [selectInterval=1 SECONDS, ioThreadCount=1, soTimeout=0 MILLISECONDS, soReuseAddress=false, soLinger=-1 SECONDS, soKeepAlive=false, tcpNoDelay=true, trafficClass=0, sndBufSize=0, rcvBufSize=0, backlogSize=0, socksProxyAddress=null]

consumer-search (resource request/limit overrides app.taskManager.resource.cpu: 2) correctly gets ioThreadCount=2, see logs:

Async HTTP client I/O config [selectInterval=1 SECONDS, ioThreadCount=2, soTimeout=0 MILLISECONDS, soReuseAddress=false, soLinger=-1 SECONDS, soKeepAlive=false, tcpNoDelay=true, trafficClass=0, sndBufSize=0, rcvBufSize=0, backlogSize=0, socksProxyAddress=null]
Gehel claimed this task.