As a maintainer of the wdqs streaming updater I want to understand why the checkpoint _metadata file has grown to 70m (which requires bumping flink memory limits) so that I can prevent it from happening again.
The flink application died around 2021-07-17T13:20:00 time at which a switch in row A (codfw) died.
It is unclear if the growth of the _metadata file size is related to the failure or if it prevented the restart of the pipeline (default akka.framesize too small)
A copy of the checkpoint _metadata file has been kept in stat1004:/home/dcausse/flink-1.12.1-wdqs/wdqs_streaming_updater/checkpoints/b4d1cd3eb1ab4002a63b7c229a8c3542/chk-140815)
The pipeline was able to restart after tuning akka.framesize to 100Mb and giving more heap.
A savepoint was then taken but created a metadata file even bigger (800Mb). It's available at swift://updater.thanos-swift/wdqs_streaming_updater/savepoints/savepoint-f6a960-fdd300f4e05b.
AC:
- understand what caused the growth to the _metadata file
- fix the underlying issue