Page MenuHomePhabricator

ElasticSearch Not enough active copies to meet write consistency
Closed, ResolvedPublicPRODUCTION ERROR

Description

There are a bunch of errors looking like Search backend error during sending {numBulk} documents to the {indexType} index after {took}: {message}.

Example:

UnavailableShardsException[[ptwiki_content_first][4] Not enough active copies to meet write consistency of [QUORUM] (have 1, needed 2).

Happens on all wiki and both wmf branch.

https://logstash.wikimedia.org/#dashboard/temp/AVRYZCfBO3D718AOMkHH

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

During deployment of T110236 we started to see eratic behaviour of the codfw elasticsearch cluster. All traffic has been routed to eqiad for a full cluster restart. Cluster recovery after restart is taking longer than expected. During that time updates to codfw elasticsearch cluster are expected to fail. A reindex of lost updates will be needed after recovery.

Mentioned in SAL [2016-04-27T16:58:22Z] <gehel> increase throttling limit and concurrency on recoveries for elasticsearch codfw cluster (T133784)

Change 285698 had a related patch set uploaded (by EBernhardson):
Stop pushing elasticsearch writes to codfw

https://gerrit.wikimedia.org/r/285698

Change 285698 merged by jenkins-bot:
Stop pushing elasticsearch writes to codfw

https://gerrit.wikimedia.org/r/285698

Mentioned in SAL [2016-04-27T18:55:09Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-production.php: Drop codfw from elasticsearch config T133784 (duration: 00m 25s)

Mentioned in SAL [2016-04-27T18:55:55Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Drop codfw from elasticsearch config T133784 (duration: 00m 36s)

Mentioned in SAL [2016-04-27T21:00:15Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-production.php: Restore codfw to elasticsearch config T133784 (duration: 00m 37s)

Mentioned in SAL [2016-04-27T21:01:04Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Restore codfw to elasticsearch config T133784 (duration: 00m 31s)

Cluster master appears to have gotten into a bad state. We ended up stopping all reads/writes to the cluster. This on it's own didn't fix the problem, so we restarted the master node. The cluster came back into a good state after that and we re-enabled writes.

The log spam is gone:

Capture d’écran 2016-04-28 à 12.23.44.png (379×814 px, 34 KB)

Thank you!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM