Streaming Updater should still make forward progress when one index has problems
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Feb 7 2024, 10:26 PM

Description

The wikidatawiki_content index went red in cloudelastic today. That's not a great thing, but it's only one index and we have recovery procedures. The single red index seems to have caused the streaming updater to completely stop operation. The taskmanager failed in the sink writer and could not make forward progress until the index was restored. The specific stack trace here suggests the timeouts are unaligned between elasticsearch and our http library, but ideally we should test this scenario directly.

Stack Trace:

org.apache.flink.util.FlinkRuntimeException: Complete bulk has failed.
    at org.apache.flink.connector.elasticsearch.sink.ElasticsearchWriter$BulkListener.lambda$afterBulk$1(ElasticsearchWriter.java:239)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:383)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:345)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
    at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
    at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
    ... 1 more

AC:

If some indices are unwritable the writes to those indices should be marked as failed and other writes should continue

Details

Subject	Repo	Branch	Lines +/-
Search update pipeline: bump version	operations/deployment-charts	master	+5 -2
Search update pipeline: bump version	operations/deployment-charts	master	+1 -1
Search update pipeline: bump version	operations/deployment-charts	master	+1 -1
Search update pipeline: bump version	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Title	Reference	Author	Source Branch	Dest Branch
Retry on 503 from ES	repos/search-platform/cirrus-streaming-updater!115	pfischer	retry-503	main
Allow overriding max. bulk request size	repos/search-platform/cirrus-streaming-updater!112	pfischer	es-sink-configuration	main
Do not fail over leftover bytes during deserialization	repos/search-platform/cirrus-streaming-updater!111	pfischer	fix-es-writer-deserialization	main
Handle faulty indices, by synchronizing timeouts	repos/search-platform/cirrus-streaming-updater!107	pfischer	es-response-timeout	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T317045 [Epic] Re-architect the Search Update Pipeline
		Resolved		pfischer	T356933 Streaming Updater should still make forward progress when one index has problems