Page MenuHomePhabricator

Scale up the SUP
Closed, ResolvedPublic3 Estimated Story Points

Description

When trying to backfill the articlecountry weighted_tags (T385970) I notice that the consumer-search flink job was having difficulty to keep-up with the update rate burning all of our update lag budget.

image.png (558×1 px, 65 KB)

Looking at the various graphs it appeared that the elasticsearch-sink was the bottleneck causing much of the back-pressure.
image.png (891×1 px, 642 KB)

While I'm not 100% sure we can trust such graph there are other hints that the sink was looking at envoy telemetry:
image.png (547×3 px, 290 KB)

The p90 suddenly rise when the backfill is pushing weighted tags.
My understanding is that the elastic sink being tune for large payloads might sent a lot of small weighted updates causing the bulk request time to drastically increase.
I did try to force a limit on the number of bulk actions to 100 with a manual deploy:
helmfile -e eqiad --selector name=consumer-search -i apply --set app.config_files.app\\.config\\.yaml.elasticsearch-bulk-flush-max-actions=100
This had an impact on the request time but not really on the throughput sadly.

My impression is that we should be able to push more, mjolnir bulk loader seems to be able to import up to 1k tags/sec. Mjolnir bulk loader might be faster since it creates bulk operations on a single index but still, an additional throughput of 40tags/sec caused the SUP to lag behind.

One possible improvements (not really related to the elastic sink) is that weighted tag updates appear to flow through the ordered AsyncIO operator, we might gain some fluidity by skipping the fetch operator.

We could also try to increase the parallelism of the job, we currently run with a parallelism of 2, perhaps this is not enough?

AC:

  • try to understand where the bottleneck is and improve the throughput of the pipeline

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1121621 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: scale up the consumer-search job

https://gerrit.wikimedia.org/r/1121621

Change #1121621 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: scale up the consumer-search job

https://gerrit.wikimedia.org/r/1121621

Gehel set the point value for this task to 3.

Change #1124166 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: scale up consumer-cloudelastic

https://gerrit.wikimedia.org/r/1124166

Change #1124166 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: scale up consumer-cloudelastic

https://gerrit.wikimedia.org/r/1124166

Can we consider this done? All three consumers are showing 3 taskmanagers, but I'm not sure if we've verified that resolves the problem with backlogging.