Page MenuHomePhabricator

Investigate eqiad cluster quorum failure issues
Closed, ResolvedPublic

Description

We have seen 3 cluster quorum failures strike the primary eqiad CirrusSearch cluster (chi), Two of which were user-impacting.

The times were

  • 2025-07-07 17:10-17:40 UTC (ref T398856)
  • 2025-07-21 23:52 2025-07-22 00:17:45
  • 2025-07-23 22:39:01 - 22:43:55 <- appeared to clear on its own

Creating this ticket to document our investigations.

Some observations:

  • It appears that restarting the active master is enough to trigger the cluster quorum loss.
  • When the cluster is down, it triggers alerts in #wikimedia-traffic such as FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1122 . These only ever seem to trigger on the master hosts, which is interesting.
  • Other clusters and environments do not seem to be affected.
  • There were network issues related to row E and F cirrussearch hosts in T393911 - but, we also have master hosts for the smaller clusters in rows E and F, and they don't have these quorum failures.

Event Timeline

bking renamed this task from Write incident report for cirrussearch outage 2025-07-21 23:52 2025-07-22 00:17:45 to Investigate eqiad cluster quorum failure issues.Jul 23 2025, 11:14 PM
bking updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-07-23T23:15:03Z] <inflatador> pool cirrussearch eqiad, will resume investigations tomorrow T400160

I've been using this Elastic doc as my primary source for troubleshooting info.

Digging through the logs on the masters we see a massive amount of master changes during the cluster instability:

 $ ansible eqiad_chi_masters -m shell -a 'cat  /var/log/opensearch/production-search-eqiad.log.1 | grep -c "elected-as-master"'
cirrussearch1100.eqiad.wmnet | CHANGED | rc=0 >>
2087
cirrussearch1081.eqiad.wmnet | CHANGED | rc=0 >>
533
cirrussearch1122.eqiad.wmnet | FAILED | rc=1 >>
0non-zero return code
cirrussearch1094.eqiad.wmnet | CHANGED | rc=0 >>
3162
cirrussearch1074.eqiad.wmnet | CHANGED | rc=0 >>
3808

Compare to normal operating conditions:

$ ansible eqiad_chi_masters -m shell -a 'zcat /var/log/opensearch/production-search-eqiad.log.3 | grep -c "elected-as-master"'
cirrussearch1094.eqiad.wmnet | CHANGED | rc=0 >>
6
cirrussearch1081.eqiad.wmnet | CHANGED | rc=0 >>
18
cirrussearch1100.eqiad.wmnet | CHANGED | rc=0 >>
8
cirrussearch1074.eqiad.wmnet | CHANGED | rc=0 >>
6
cirrussearch1122.eqiad.wmnet | CHANGED | rc=0 >>
10

There are some expert-level cluster settings around master elections that we could try to change, but the questions remain:

  • Why has this never been a problem until we switched to OpenSearch?
  • Why doesn't it affect other clusters/environments?

For the next step, I'd recommend adding a new master from one of the older rows and removing one of E or F masters to see if we can reproduce this. We could also experiment with setting voting exclusions before we restart a particular node (although that's more of a workaround than a solution).

Per IRC conversation in Wikimedia-Search , @EBernhardson saw a 503 master not discovered exception in relforge-alpha when trying to load some data.

We saw the following errors:
Null Pointer Exception due to org.wikimedia.search.extra.analysis.ukrainian.UkrainianStopFilterFactory.getStopwords(UkrainianStopFilterFactory.java:31)

^^ We also saw this error in the logs during the production cluster quorum failures.

ERROR OpenSearchJsonLayout contains invalid attributes "compact", "complete"
ERROR Could not create plugin of type class org.opensearch.common.logging.OpenSearchJsonLayout for element OpenSearchJsonLayout: java.lang.IllegalArgumentException: layout parameter 'type_name' cannot be empty java.lang.IllegalArgumentException: layout parameter 'type_name' cannot be empty

That one's new to me, although it suggests that we may need to look closer at T395571 . I doubt it explains any of the quorum problems, but it's worth documenting just in case.

We should also look at the performance governor settings for the master-eligible hosts (for all clusters/environments).

Icinga downtime and Alertmanager silence (ID=3bf2f233-6bed-47bf-b6d7-0aa2dd2951e8) set by bking@cumin2002 for 1:00:00 on 55 host(s) and their services with reason: investigate cluster quorum failure

cirrussearch[1068-1103,1107-1125].eqiad.wmnet

When we restart the currently elected master node the production-search-eqiad cluster seems to fall over 100% of the time. (This issue has not manifested in the psi/omega clusters nor in codfw).

Logs show stuff like the following:

[2025-08-12T21:25:03,614][INFO ][o.o.c.c.JoinHelper       ] [cirrussearch1074-production-search-eqiad] failed to join {cirrussearch1081-production-search-eqiad}{86WNBGofQgG4pse5vSF7tw}{BtPY_wIDT_-QQtogpVwqqw}{10.64.32.166}{10.64.32.166:9300}{dimr}{hostname=cirrussearch1081, rack=C4, row=eqiad-row-c, shard_indexing_pressure_enabled=true, fqdn=cirrussearch1081.eqiad.wmnet} with JoinRequest{sourceNode={cirrussearch1074-production-search-eqiad}{__rTSNj8T92iTZa9k8DFgw}{deixoTWxRlGwu7FNR8h6Ew}{10.64.16.42}{10.64.16.42:9300}{dimr}{hostname=cirrussearch1074, rack=B2, fqdn=cirrussearch1074.eqiad.wmnet, row=eqiad-row-b, shard_indexing_pressure_enabled=true}, minimumTerm=221014, optionalJoin=Optional[Join{term=221014, lastAcceptedTerm=188677, lastAcceptedVersion=4008456, sourceNode={cirrussearch1074-production-search-eqiad}{__rTSNj8T92iTZa9k8DFgw}{deixoTWxRlGwu7FNR8h6Ew}{10.64.16.42}{10.64.16.42:9300}{dimr}{hostname=cirrussearch1074, rack=B2, fqdn=cirrussearch1074.eqiad.wmnet, row=eqiad-row-b, shard_indexing_pressure_enabled=true}, targetNode={cirrussearch1081-production-search-eqiad}{86WNBGofQgG4pse5vSF7tw}{BtPY_wIDT_-QQtogpVwqqw}{10.64.32.166}{10.64.32.166:9300}{dimr}{hostname=cirrussearch1081, rack=C4, row=eqiad-row-c, shard_indexing_pressure_enabled=true, fqdn=cirrussearch1081.eqiad.wmnet}}]}
org.opensearch.transport.RemoteTransportException: [cirrussearch1081-production-search-eqiad][10.64.32.166:9300][internal:cluster/coordination/join]

Opensearch rapidly and repeatedly emits those types of log messages, as described in https://phabricator.wikimedia.org/T400160#11032897. Currently we've been restoring the cluster by restarting different master nodes until the cluster returns to being happy again (it feels like there's an element of randomness here).

We're a bit lost on next steps currently. We need to get a deeper understanding of the leader election process and the relevance of term.
Open question: does term number represent a successful election having occurred, or is it rather a counter of the election attempt itself? In other words, is the node failing to join because it thinks it should be the master but a newer master has since been elected, or is it that there has been no successful election at all because the elections are cycling so rapidly that not enough hosts can participate for a proper election to occur?

bking changed the task status from Open to In Progress.Aug 19 2025, 6:02 PM

Yesterday, we restarted all clusters to re-enable our logstash pipeline (ref T395571). Predicticably, this triggered the same quorum failure. Interestingly, it fixed itself after about 25 minutes.

Approx time of outage based on logs: Aug 18th, 21:23:05-21:48:17 UTC

Yesterday, we restarted all clusters to re-enable our logstash pipeline (ref T395571). Predicticably, this triggered the same quorum failure. Interestingly, it fixed itself after about 25 minutes.

Approx time of outage based on logs: Aug 18th, 21:23:05-21:48:17 UTC

I collected the start timestamp of the nodes here: https://docs.google.com/spreadsheets/d/1Qwrf6_8ZxCptIGFXE26deFSWNu_AgDVbA4XrgoxDHUU/edit?usp=sharing

During the failure I noted (from the hot threads):

  • on 1074 (the previous and new master):
app//org.opensearch.cluster.coordination.PublicationTransportHandler.serializeFullClusterState(PublicationTransportHandler.java:293)
app//org.opensearch.cluster.coordination.PublicationTransportHandler.access$100(PublicationTransportHandler.java:79)
app//org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext.buildDiffAndSerializeStates(PublicationTransportHandler.java:342)
app//org.opensearch.cluster.coordination.PublicationTransportHandler.newPublicationContext(PublicationTransportHandler.java:284)
app//org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1285)
app//org.opensearch.node.Node$$Lambda$2859/0x00000008409e8040.publish(Unknown Source)
app//org.opensearch.cluster.service.MasterService.publish(MasterService.java:303)
app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285)
app//org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86)
app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173)
app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:174)
app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:212)
app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238)
java.base@11.0.28/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@11.0.28/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.28/java.lang.Thread.run(Thread.java:829)

(reading the code this serializes the cluster state in memory in preparation to publish the cluster state)

  • on other master eligible nodes:
app//org.opensearch.cluster.node.DiscoveryNodeFilters.match(DiscoveryNodeFilters.java:228)
app//org.opensearch.cluster.routing.allocation.decider.FilterAllocationDecider.shouldClusterFilter(FilterAllocationDecider.java:244)
app//org.opensearch.cluster.routing.allocation.decider.FilterAllocationDecider.shouldAutoExpandToNode(FilterAllocationDecider.java:144)
app//org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.shouldAutoExpandToNode(AllocationDeciders.java:166)
app//org.opensearch.cluster.metadata.AutoExpandReplicas.getDesiredNumberOfReplicas(AutoExpandReplicas.java:141)
app//org.opensearch.cluster.metadata.AutoExpandReplicas.getAutoExpandReplicaChanges(AutoExpandReplicas.java:183)
app//org.opensearch.cluster.routing.allocation.AllocationService.adaptAutoExpandReplicas(AllocationService.java:335)
app//org.opensearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:225)
app//org.opensearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:161)
app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:804)
app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:378)
app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:249)

(which is I think applying the node bans when trying to determine the number of replicas for indices with auto_expand_replicas which is most of our indices)

  • other nodes, checked a couple and they're mostly idle (nothing in hot threads)

The cluster settings in eqiad contains 39 banned node names. It'd be concerning if this was the cause but it's one of the difference I spotted between eqiad & codfw.

Mentioned in SAL (#wikimedia-operations) [2025-08-21T21:26:51Z] <bking@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T400160

After clearing the banned node state mentioned by @dcausse above, we successfully failed over the active master 3 times without any quorum issues. Many thanks to him for going above and beyond with his troubleshooting!

This issue also highlights problems with config drift in the cirrussearch clusters. I've added an item to T399900 to address this.