Page MenuHomePhabricator

Streaming Updater should still make forward progress when one index has problems
Closed, ResolvedPublic5 Estimated Story Points

Description

The wikidatawiki_content index went red in cloudelastic today. That's not a great thing, but it's only one index and we have recovery procedures. The single red index seems to have caused the streaming updater to completely stop operation. The taskmanager failed in the sink writer and could not make forward progress until the index was restored. The specific stack trace here suggests the timeouts are unaligned between elasticsearch and our http library, but ideally we should test this scenario directly.

Stack Trace:

org.apache.flink.util.FlinkRuntimeException: Complete bulk has failed.
    at org.apache.flink.connector.elasticsearch.sink.ElasticsearchWriter$BulkListener.lambda$afterBulk$1(ElasticsearchWriter.java:239)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:383)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:345)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
    at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
    at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
    ... 1 more

AC:

  • If some indices are unwritable the writes to those indices should be marked as failed and other writes should continue

Details

TitleReferenceAuthorSource BranchDest Branch
Retry on 503 from ESrepos/search-platform/cirrus-streaming-updater!115pfischerretry-503main
Allow overriding max. bulk request sizerepos/search-platform/cirrus-streaming-updater!112pfischeres-sink-configurationmain
Do not fail over leftover bytes during deserializationrepos/search-platform/cirrus-streaming-updater!111pfischerfix-es-writer-deserializationmain
Handle faulty indices, by synchronizing timeoutsrepos/search-platform/cirrus-streaming-updater!107pfischeres-response-timeoutmain
Customize query in GitLab

Event Timeline

Gehel set the point value for this task to 5.Feb 12 2024, 4:26 PM

Change #1015276 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1015276

Change #1015276 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1015276

Change #1015337 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1015337

Change #1015337 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1015337

Change #1016359 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1016359

Change #1016359 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1016359

Change #1016861 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1016861

Change #1016861 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: bump version

https://gerrit.wikimedia.org/r/1016861