The consumer job (indexer) is not able to keep up the current update rate despite having the AsyncIO operator configured with a capacity of 100 (which means 100 concurrent requests).
Here are the actions taken so far and to be taken (please add more):
- rule out the elasticsearch sink as the bottleneck using a nullsink
- writing to /dev/null did not help throughput
- fix the async http client config to be aligned with the asyncIO capacity
- helped a bit but did not see the number of pending requests still not capped at 25 (the new reduced capacity)
- switching to UNORDERED did seem to help throughput
- bumping from 25 to 100 keeping ORDERED did help
- separating the fetch operators with an ORDERED one for rev_based updates and an UNORDERED one for re-render
- did not seem to have the expected outcome, throughput remains identical but having a more granular streamGraph shows that the backpressure is caused by rerenders
- actually this fix was not properly made and ended up with two ORDERED async operators
- second attempt at: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/93
- throughput is much higher after deploying the patch above, processing between 600 and 800 records per sec with a capacity of 200: 2tm * 100 (0.85/0.15 for rerenders/revbased)
- bump k8s resources (mem: 2G->3G, cpu: 1->2), there was some suspicion of k8s doing cpu throttling and the jvm doing excessive young gc
- did not seem to have an impact, the cpu throttling reduced and young gc times as well but throughput remained roughly the same
- It seems plausible that flink defaults are tuned for relatively small event sizes, but our events are 10's of kb, up to several mb. Ran test of full pipeline but skipping the content merge to see if it is having an effect on throughput. Test result was that it ran basically the same with the change.
- better understand how the async http creates its threadpool (seems to rely on getRuntime().getNumberOfProcessor() which might be incorrect on k8s) and figure out if fine tuning IOReactorConfig.Builder#setIoThreadCount manually could help.
- JVM correctly picks up cgroup boundaries, see comment
- ruleout envoy as the bottleneck
- envoy is configured to have practically no limits (50k concurrent requests allowed)
- container seems to be heavily cpu throttled (see T353460#9407621)
- de-activating envoy (talking directly to the mw-api-int endpoint) we saw greater throughtput with much fewer retries (envoy being heavily cpu throttled), it suggests that envoy might need some tuning to work properly with the throughput we expect.
- see T354517 for further tweaking
With a given 100 capacity for the async operator spread as 15% for the revision based updates and the rest for rerenders we should expect to see 160 rps made the mw-api-async-ro (assuming an avg latency of 0.5s), but we only see ~70 currently.
AC:
- throughput limitations are better understood
- the SUP should have the proper knobs to effectively tune its throughput.

