Goal
To improve throughput of real-time pipelines, we should consider making enrichment function asynchronous. Initial tuning work (https://phabricator.wikimedia.org/project/view/6387/) suggests that:
- aiohttp/asyncaio co-courtines seem to be non pickable.
- Requests-futures (concurrent.futures based) also does not play nice with beam’s ThreadPool/pickling either.
TNG suggested to perform manual microbatching/windowing, then dispatching the entire batch/window together into the chosen async construction and forward the results once everything is collected.
The open method in the python MapFunction to create an operator-local ThreadPool and submit in the .flat_map() the http call to the ThreadPool; block until the call finishes and use the plain result (not the future) as output. This avoids the issue of pickling since this happens all locally in the operator without serde.
In the simplest form, where we do a microbatch and wait for all HTTP reqs on the microbatch to finish, the checkpoint marker only proceeds once the batch is done. This is safe (no data lost on crash) and retries (at least) the current microbatch.
Success criteria
- we should validate the approach works as expected.
- mediawiki-event-enrichment has been tuned and works reliably on YARN and DSE k8s
- changes specific to mediawiki-event-enrichment have been upstreamed to eventutilities-python and documented in the design doc.
- mediawiki_page_content_change enrichment is deployed and running in DSE