Page MenuHomePhabricator

Investigate EOFException when performing the first checkpoint after restoring from a savepoint
Closed, ResolvedPublic

Description

While deploying a new version of the streaming-updater (0.3.103) flink failed with:

java.io.EOFException
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:202)
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
	at org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.deserialize(BytePrimitiveArraySerializer.java:82)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData(RocksDBFullRestoreOperation.java:229)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBFullRestoreOperation.java:158)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restore(RocksDBFullRestoreOperation.java:142)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:284)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:587)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:93)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:425)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:535)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:525)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:565)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
	at java.base/java.lang.Thread.run(Thread.java:834)

notes:

  • the problem occurs when using two different savepoints:
    • (thanos) rdf-streaming-updater-codfw/commons/savepoints/deploy_0_3_103/savepoint-c4b021-9c4cd6541ec7
    • (thanos) rdf-streaming-updater-codfw/commons/savepoints/savepoint-c4b021-818dd669a47e/
  • the exception is only visible on a taskmanager POD running on kubernetes2003 (logs), triggered the problem on another node
  • the problem occurs when loading the savepoints using the previous version 0.3.99
  • the deploy worked fine on staging for both wdqs and wcqs, fine as well on wcqs at codfw
  • the system was able to resume using 0.3.99 and a previous checkpoint rdf-streaming-updater-codfw/wikidata/checkpoints/e245dd1e76d56d9ded351b27cf2d4c2a/chk-415014.
  • the flink app is working fine in yarn using one the savepoints

AC:

  • understand the root cause of the failure

Event Timeline

Root cause seems swift related:

Saw this in taskmanager logs:

Received IOException while reading 'swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/savepoint-0d1c37-86ed4cb29023/bc3ce8ed-70b2-4e91-a81b-07f585dd0f1f', attempting to reopen: org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.exceptions.SwiftConnectionClosedException: Connection to Swift service has been closed: read() -all data consumed

logged by org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.snative.SwiftNativeInputStream

Using the S3 client have fixed this issue, all jobs are now running 0.3.104 of the streaming-updater and are using the s3 client to persist their durable state.

Gehel claimed this task.