Investigate EOFException when performing the first checkpoint after restoring from a savepoint
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Feb 23 2022, 1:52 PM

Description

While deploying a new version of the streaming-updater (0.3.103) flink failed with:

java.io.EOFException
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:202)
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
	at org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.deserialize(BytePrimitiveArraySerializer.java:82)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData(RocksDBFullRestoreOperation.java:229)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBFullRestoreOperation.java:158)
	at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restore(RocksDBFullRestoreOperation.java:142)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:284)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:587)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:93)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:425)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:535)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:525)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:565)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
	at java.base/java.lang.Thread.run(Thread.java:834)

notes:

the problem occurs when using two different savepoints:
- (thanos) rdf-streaming-updater-codfw/commons/savepoints/deploy_0_3_103/savepoint-c4b021-9c4cd6541ec7
- (thanos) rdf-streaming-updater-codfw/commons/savepoints/savepoint-c4b021-818dd669a47e/
~~the exception is only visible on a taskmanager POD running on kubernetes2003 (logs)~~, triggered the problem on another node
the problem occurs when loading the savepoints using the previous version 0.3.99
the deploy worked fine on staging for both wdqs and wcqs, fine as well on wcqs at codfw
the system was able to resume using 0.3.99 and a previous checkpoint rdf-streaming-updater-codfw/wikidata/checkpoints/e245dd1e76d56d9ded351b27cf2d4c2a/chk-415014.
the flink app is working fine in yarn using one the savepoints

AC:

understand the root cause of the failure

Related Objects

Mentioned In: T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol
Mentioned Here: T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol

Event Timeline

dcausse created this task.Feb 23 2022, 1:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2022, 1:52 PM

Maintenance_bot added a project: Wikidata.Feb 23 2022, 2:45 PM

dcausse updated the task description. (Show Details)Feb 23 2022, 3:55 PM

dcausse updated the task description. (Show Details)Feb 23 2022, 4:12 PM

EBernhardson subscribed.Feb 23 2022, 7:02 PM

Root cause seems swift related:

Saw this in taskmanager logs:

Received IOException while reading 'swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/savepoint-0d1c37-86ed4cb29023/bc3ce8ed-70b2-4e91-a81b-07f585dd0f1f', attempting to reopen: org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.exceptions.SwiftConnectionClosedException: Connection to Swift service has been closed: read() -all data consumed

logged by org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.snative.SwiftNativeInputStream

dcausse mentioned this in T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol.Feb 24 2022, 2:11 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Feb 28 2022, 4:34 PM

Gehel added a project: Discovery-Search (Current work).

Waiting on T302494

Gehel moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Mar 14 2022, 3:12 PM

Using the S3 client have fixed this issue, all jobs are now running 0.3.104 of the streaming-updater and are using the s3 client to persist their durable state.

Gehel closed this task as Resolved.Mar 17 2022, 1:00 PM

Gehel claimed this task.

Investigate EOFException when performing the first checkpoint after restoring from a savepointClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate EOFException when performing the first checkpoint after restoring from a savepoint
Closed, ResolvedPublic
Actions