We have seen 3 cluster quorum failures strike the primary eqiad CirrusSearch cluster (chi), Two of which were user-impacting.
The times were
- 2025-07-07 17:10-17:40 UTC (ref T398856)
- 2025-07-21 23:52 2025-07-22 00:17:45
- 2025-07-23 22:39:01 - 22:43:55 <- appeared to clear on its own
Creating this ticket to document our investigations.
Some observations:
- It appears that restarting the active master is enough to trigger the cluster quorum loss.
- When the cluster is down, it triggers alerts in #wikimedia-traffic such as FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1122 . These only ever seem to trigger on the master hosts, which is interesting.
- Other clusters and environments do not seem to be affected.
- There were network issues related to row E and F cirrussearch hosts in T393911 - but, we also have master hosts for the smaller clusters in rows E and F, and they don't have these quorum failures.