After some discussions, we have decided to implement the following improvements to our HTML pipeline:
- Avoid enriching non html wiki content models.
- Avoid envoy retries (there's a header)
- Async for main and parent calls (and avoid 2 Flink process functions)
- This helped.
- Use asyncio, Don't use requests
- Try: https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/deployment/config/#pipeline-object-reuse
- tried, but having issues with it.
- We could do some work in eventutilities-python to make sure the correct RowTypeInfo is being used, but on second thought, pipeline-object-reuse might not help pyflink much anyway.
- Avoid too many task managers (slower checkpoints, more buffers, everything more complicated).
- This helped. We still have a lot of TMs, but reducing them and increasing numberOfTaskSlots does seem to help.
- Try buffer debloating
- This helped.
- Try unaligned checkpointing.
- We think this helped, but aren't sure. We don't keep (much) state.
- However, see T421216#11834556. Do unaligned checkpoints even work if the process function is blocking?
- Try pyflink thread mode ?
- Tried, but encountered a bug? - T421216#11833754
- Try JEMALLOC again. On Java 17 maybe better?
- Try sync mode with python bundle size = 1.
- This helped a lot. Applying backpressure as early as possible and allowing process functions to finish as quickly as possible (not blocking on the full batch) allowed checkpoints to progress better.
- Use async ordered batch.
- This only seems to make things slightly worse.
- Implement async batch ordering for keys only
- E.g. group each batch by key, and if all keys have been resolved, the results can be yielded asap without waiting for the entire batch.
- Use fully unordered async
- We'd prefer not to do this.
- Reduce checkpoint intervals
- This helped a lot. The more checkpoints, the more aligning on checkpoint barriers the subtasks have to do, the slower the pipeline gets. These seems to add up over time.