A design document is being written and a lot of decisions need to be made about the next architecture of dumps. It's hard to make some of those decisions without experimenting a little with Flink and what it could and couldn't do in terms of scale and friendliness to developers.
Update
The Flink prototype has been stuck for a while on a confusing pom configuration issue to do with Java service providers and registering all that's needed for the Kafka -> Iceberg pipeline. Code with a few different iterations available on Dan's github. Going to break down into some more exploration to try to solve the problem by walking around it:
- write the prototype in Spark Streaming (was already a separate task: T322326: Prototype Spark Streaming Job for Content Dumps)
- try Maven wrapper plugin and get advice from Guillaume
- try to write to just hdfs from Kafka (already done, but maybe playing more with it will shed some light)
- try to write to Iceberg backed by local filesystem (probably same issue of including the fs.impl)
- talk more with Flink slack community (their suggestions)
- try gradle instead of maven (this example is the closest I've been able to find to what we're facing and they use gradle) - canceled (got advice that it would have the same issue as maven)