Page MenuHomePhabricator

Checkpoint _metadata has grown up to 70Mb
Closed, DeclinedPublic5 Estimated Story Points

Description

As a maintainer of the wdqs streaming updater I want to understand why the checkpoint _metadata file has grown to 70m (which requires bumping flink memory limits) so that I can prevent it from happening again.

The flink application died around 2021-07-17T13:20:00 time at which a switch in row A (codfw) died.

It is unclear if the growth of the _metadata file size is related to the failure or if it prevented the restart of the pipeline (default akka.framesize too small)

A copy of the checkpoint _metadata file has been kept in stat1004:/home/dcausse/flink-1.12.1-wdqs/wdqs_streaming_updater/checkpoints/b4d1cd3eb1ab4002a63b7c229a8c3542/chk-140815)

The pipeline was able to restart after tuning akka.framesize to 100Mb and giving more heap.
A savepoint was then taken but created a metadata file even bigger (800Mb). It's available at swift://updater.thanos-swift/wdqs_streaming_updater/savepoints/savepoint-f6a960-fdd300f4e05b.

AC:

  • understand what caused the growth to the _metadata file
  • fix the underlying issue

Event Timeline

Change 721812 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] cleanup: Remove needless file sink outputs

https://gerrit.wikimedia.org/r/721812

Analyzed the large _metadata file and it has 3 operators with very large states esp. max-part-counter owned by StreamingFileSink, this state is cleared when calling org.apache.flink.streaming.api.checkpoint.CheckpointedFunction#snapshotState which is triggered on a SinkFunction only when marked with such interface. The sole file sinks we used were the side outputs when we stored them in hdfs. Prior https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/695295 we wrapped the file sink and & kafka sink with a generic SinkFunction to unify the serialization this caused code relying on instanceof CheckpointedFunction to not work properly. It is very likely that this broken _metadata was generated because of this.

I propose to decline this and cleanup the code to remove all the file sinks since we do not them anymore (even if I think they no longer cause issues).

Change 721812 merged by jenkins-bot:

[wikidata/query/rdf@master] cleanup: Remove needless file sink outputs

https://gerrit.wikimedia.org/r/721812