Checkpoint _metadata has grown up to 70Mb
Closed, DeclinedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Jul 19 2021, 9:51 AM

Description

As a maintainer of the wdqs streaming updater I want to understand why the checkpoint _metadata file has grown to 70m (which requires bumping flink memory limits) so that I can prevent it from happening again.

The flink application died around 2021-07-17T13:20:00 time at which a switch in row A (codfw) died.

It is unclear if the growth of the _metadata file size is related to the failure or if it prevented the restart of the pipeline (default akka.framesize too small)

A copy of the checkpoint _metadata file has been kept in stat1004:/home/dcausse/flink-1.12.1-wdqs/wdqs_streaming_updater/checkpoints/b4d1cd3eb1ab4002a63b7c229a8c3542/chk-140815)

The pipeline was able to restart after tuning akka.framesize to 100Mb and giving more heap.
A savepoint was then taken but created a metadata file even bigger (800Mb). It's available at swift://updater.thanos-swift/wdqs_streaming_updater/savepoints/savepoint-f6a960-fdd300f4e05b.

AC:

understand what caused the growth to the _metadata file
fix the underlying issue

Details

	Subject	Repo	Branch	Lines +/-
	cleanup: Remove needless file sink outputs	wikidata/query/rdf	master	+9 -86

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T244590 [Epic] Rework the WDQS updater as an event driven application
		Declined		dcausse	T286890 Checkpoint _metadata has grown up to 70Mb

Event Timeline

dcausse created this task.Jul 19 2021, 9:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2021, 9:51 AM

Maintenance_bot added a project: Wikidata.Jul 19 2021, 10:45 AM

MPhamWMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jul 19 2021, 3:28 PM

MPhamWMF added a project: Discovery-Search (Current work).

Gehel added a parent task: T244590: [Epic] Rework the WDQS updater as an event driven application.Jul 19 2021, 3:28 PM

MPhamWMF set the point value for this task to 5.Jul 19 2021, 3:32 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

dcausse claimed this task.Sep 16 2021, 6:48 AM

dcausse moved this task from Ready for Dev -- SWE to Incoming on the Discovery-Search (Current work) board.

dcausse moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Sep 16 2021, 7:40 AM

Change 721812 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] cleanup: Remove needless file sink outputs

https://gerrit.wikimedia.org/r/721812

gerritbot added a project: Patch-For-Review.Sep 17 2021, 1:15 PM

Analyzed the large _metadata file and it has 3 operators with very large states esp. max-part-counter owned by StreamingFileSink, this state is cleared when calling org.apache.flink.streaming.api.checkpoint.CheckpointedFunction#snapshotState which is triggered on a SinkFunction only when marked with such interface. The sole file sinks we used were the side outputs when we stored them in hdfs. Prior https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/695295 we wrapped the file sink and & kafka sink with a generic SinkFunction to unify the serialization this caused code relying on instanceof CheckpointedFunction to not work properly. It is very likely that this broken _metadata was generated because of this.

I propose to decline this and cleanup the code to remove all the file sinks since we do not them anymore (even if I think they no longer cause issues).

Change 721812 merged by jenkins-bot: