We have seen that our HTML enrichment pipeline works well with regular streaming data flow, but if it stops for a few days, or we need to reprocess old data, it doesn't work as expected. It works a bit faster than the current traffic, but to backfill a few days it might need days.
We have tried different configurations and found different issues:
The regular deployment works with a single TaskManager, using 6GB of Memory and 2 CPUs.
Tests done so far:
- 20 Task Managers - 6GB - 2CPU each:
- Result: Works a bit faster, but not 20 times faster.
- 75 Task Managers - 1,5GB - 1 CPU each:
- Result: Works a few times faster than 20 Tasks, but eventually fails, probably due to OOM
- 55 Task Managers - 2GB - 1 CPU each:
- Result: Works faster than 20Tasks, a bit slower than 75 tasks, but eventual it also fails, probably due to OOM (not confirmed)
- Fake http api enrichment 'sync' mode by setting batch_size=1 and max_workers=1, increasing parallelism and memory
- This works better, but eventually fails too, not yet sure why.
We know that many HTML API requests might take 10 or 15 seconds, so we know that there is a lot of I/O waiting happening here. We also know that some messages might be big, more than 10MB and near 20MB.
We should try other configurations to know what's the best configuration for backfilling data if needed.







