Page MenuHomePhabricator

HTML Pipeline - Performance improvements
Open, Needs TriagePublic

Description

After some discussions, we have decided to implement the following improvements to our HTML pipeline:

  • Avoid enriching non html wiki content models.
  • Avoid envoy retries (there's a header)
  • Async for main and parent calls (and avoid 2 Flink process functions)
    • This helped.
  • Use asyncio, Don't use requests
  • Try: https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/deployment/config/#pipeline-object-reuse
    • tried, but having issues with it.
    • We could do some work in eventutilities-python to make sure the correct RowTypeInfo is being used, but on second thought, pipeline-object-reuse might not help pyflink much anyway.
  • Avoid too many task managers (slower checkpoints, more buffers, everything more complicated).
    • This helped. We still have a lot of TMs, but reducing them and increasing numberOfTaskSlots does seem to help.
  • Try buffer debloating
    • This helped.
  • Try unaligned checkpointing.
    • We think this helped, but aren't sure. We don't keep (much) state.
    • However, see T421216#11834556. Do unaligned checkpoints even work if the process function is blocking?
  • Try pyflink thread mode ?
  • Try JEMALLOC again. On Java 17 maybe better?
    • Tried, but not sure if it was better. Update: We ended up enabling jemalloc, it fixes a memory leak issue.
  • Try sync mode with python bundle size = 1.
    • This helped a lot. Applying backpressure as early as possible and allowing process functions to finish as quickly as possible (not blocking on the full batch) allowed checkpoints to progress better.
  • Use async ordered batch.
  • Implement async batch ordering for keys only
    • E.g. group each batch by key, and if all keys have been resolved, the results can be yielded asap without waiting for the entire batch.
  • Use fully unordered async
    • We'd prefer not to do this.
  • Reduce checkpoint intervals
    • This helped a lot. The more checkpoints, the more aligning on checkpoint barriers the subtasks have to do, the slower the pipeline gets. These seems to add up over time.

Event Timeline

Change #1270548 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-html-content-change-enrich-next - try sync mode

https://gerrit.wikimedia.org/r/1270548

Change #1270548 merged by Ottomata:

[operations/deployment-charts@master] mw-page-html-content-change-enrich-next - try sync mode

https://gerrit.wikimedia.org/r/1270548

Avoid envoy retries (there's a header)

Quickly reading T354517: Search Update Pipeline: HTTP client/proxy config and https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#id5,

It looks like a combo of x-envoy-retry-on=retriable-status-codes and x-envoy-retriable-status-codes without 504 will avoid envoy retrying on 504s?

We might want to check with @JMeybohm.

Avoid envoy retries (there's a header)

Quickly reading T354517: Search Update Pipeline: HTTP client/proxy config and https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#id5,

It looks like a combo of x-envoy-retry-on=retriable-status-codes and x-envoy-retriable-status-codes without 504 will avoid envoy retrying on 504s?

We might want to check with @JMeybohm.

From what I recall there are certain limitations on the headers that can currently be used to reconfigure envoy's behavior from client side since T354853: Service mesh envoy does not treat incoming connections as local. As a workaround an additional service-mesh listener with the expected retry policy could be created/used.

From what I recall there are certain limitations on the headers that can currently be used to reconfigure envoy's behavior from client side since T354853: Service mesh envoy does not treat incoming connections as local. As a workaround an additional service-mesh listener with the expected retry policy could be created/used.

Indeed I believe that we had to hack the request with x-forwarded-for: 127.0.0.1 to make this work as expected.
Aligning the timeout with what's expected by the client might be interesting as well, see the link above with the x-envoy-upstream-rq-timeout-ms header.

Change #1272843 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] html-enrich - bump to v1.51.0 and apply some flink tuning

https://gerrit.wikimedia.org/r/1272843

Change #1272843 merged by Ottomata:

[operations/deployment-charts@master] html-enrich - bump to v1.51.0 and apply some flink tuning

https://gerrit.wikimedia.org/r/1272843

Thanks @JMeybohm and @dcausse. I'd like to try this!

So, IIUC we should:

  • set x-envoy-max-retries: 0
  • set x-envoy-upstream-rq-timeout-ms to our client timeout - 100ms (ENVOY_OVERHEAD_MS) ?
  • set x-forwarded-for: 127.0.0.1

Is that correct?

Is there a way to the effect? Somewhere and grafana that shows envoy retries for our service? It'd be nice to see the effect explicitly rather than just hoping. :)

(QQ: which way is upstream and downstream? I would expect that requests that our service makes would be downstream, no?)

This comment was removed by Ottomata.
This comment was removed by Ottomata.

Hm, well, it isn't really better, but it isn't worse? I'm going to just reduce batch size to 2 just to try it. I'll go back up to task slots = 4 so parallelism = 80.

helmfile apply -e dse-k8s-eqiad \
  --set app.version=v1.52.0 \
  --set app.taskManager.replicas=20 \
  --set app.taskManager.resource.memory=6000m \
  --set app.flinkConfiguration.taskmanager\\.numberOfTaskSlots=4 \
  --set app.flinkConfiguration.execution\\.checkpointing\\.timeout=10m \
  --set app.flinkConfiguration.execution\\.checkpointing\\.interval=10m \
  --set app.flinkConfiguration.execution\\.checkpointing\\.interval-during-backlog=10m \
  --set app.flinkConfiguration.execution\\.checkpointing\\.min-pause=30000 \
  --set app.flinkConfiguration.execution\\.checkpointing\\.tolerable-failed-checkpoints=10 \
  --set app.flinkConfiguration.kubernetes\\.operator\\.periodic\\.savepoint\\.interval=0 \
  --set app.flinkConfiguration.python\\.fn-execution\\.bundle\\.size=1 \
  --set app.config_files.app\\.config\\.yaml.stream_manager.process_async_enabled_default=true \
    --set app.config_files.app\\.config\\.yaml.stream_manager.process_max_workers_default=2 \
  --set app.config_files.app\\.config\\.yaml.stream_manager.process_batch_size_default=2

Ah, I have been posting my recent updates on the wrong ticket! I meant to post them on T421216: HTML Enrichment - Tuning & Backfilling configuration. I'll move them there.

Thanks @JMeybohm and @dcausse. I'd like to try this!

So, IIUC we should:

  • set x-envoy-max-retries: 0
  • set x-envoy-upstream-rq-timeout-ms to our client timeout - 100ms (ENVOY_OVERHEAD_MS) ?
  • set x-forwarded-for: 127.0.0.1

Is that correct?

Is there a way to the effect? Somewhere and grafana that shows envoy retries for our service? It'd be nice to see the effect explicitly rather than just hoping. :)

For the retries I would expect to only see "limit_exceeded" in the upstream request retry panel from the envoy telemetry, regarding timeouts I'm not so sure but possibly you should rarely see requests timeouts from your app code but only 504 upstream response timeout?

(QQ: which way is upstream and downstream? I would expect that requests that our service makes would be downstream, no?)

See: https://www.envoyproxy.io/docs/envoy/latest/intro/life_of_a_request#terminology

The pipeline is finally working fine, and it is able to backfill data if needed.

Do you think we can close this task @Ottomata ? I know we could still try other improvements, but I don't think we are going to spend time on them.

Hm.

Yeah, I'd like to try the keyed async batch idea. It may really help throughput during backfills. It also might help reduce overall resource usage.

OTOH, if we wait long enough we might just be able to use pyflink async functions in Flink 2.2, so maybe that is a better thing to spend our future time on.

I also still wonder if we need to implement the overall timeout idea, rather than just relying on the envoy timeout. I suppose we've accomplished approximately the same thing: envoy timeout is 60 seconds, and we don't retry on 504 timeout.

Okay, yes we can close it.