When trying to backfill the articlecountry weighted_tags (T385970) I notice that the consumer-search flink job was having difficulty to keep-up with the update rate burning all of our update lag budget.
Looking at the various graphs it appeared that the elasticsearch-sink was the bottleneck causing much of the back-pressure.
While I'm not 100% sure we can trust such graph there are other hints that the sink was looking at envoy telemetry:
The p90 suddenly rise when the backfill is pushing weighted tags.
My understanding is that the elastic sink being tune for large payloads might sent a lot of small weighted updates causing the bulk request time to drastically increase.
I did try to force a limit on the number of bulk actions to 100 with a manual deploy:
helmfile -e eqiad --selector name=consumer-search -i apply --set app.config_files.app\\.config\\.yaml.elasticsearch-bulk-flush-max-actions=100
This had an impact on the request time but not really on the throughput sadly.
My impression is that we should be able to push more, mjolnir bulk loader seems to be able to import up to 1k tags/sec. Mjolnir bulk loader might be faster since it creates bulk operations on a single index but still, an additional throughput of 40tags/sec caused the SUP to lag behind.
One possible improvements (not really related to the elastic sink) is that weighted tag updates appear to flow through the ordered AsyncIO operator, we might gain some fluidity by skipping the fetch operator.
We could also try to increase the parallelism of the job, we currently run with a parallelism of 2, perhaps this is not enough?
AC:
- try to understand where the bottleneck is and improve the throughput of the pipeline


