Page MenuHomePhabricator

figure out deterministic way to tell if a rabbitmq cluster is paritioned
Open, LowPublic

Description

As of today, if the rabbitmq cluster is in split-brain mode (partitioned), we mostly notice only because clients misbehave (i.e, openstack services).

It would be good to define a set of checks and patterns that helps us determine in a deterministic way if the cluster is healthy or not (beyond what you can do with rabbitmqctl, which is not enough).

Having such thing would help us create monitoring and automation.

Event Timeline

I don't know if this is easily transferable to our systems, but https://opendev.org/openstack/charm-rabbitmq-server/commit/0653c186cecf720c522353da6169a2ecf05d3284 is titled "Rabbitmq metrics and splitbrain detection" and includes this config snippet:

files/prom_rule_rmq_splitbrain.yaml
- alert: RabbitMQ_split_brain
# detect if rabbitmq_queues is different between rabbitmq nodes
  expr: count(count(rabbitmq_queues) by (job)) > 1
  for: 5m
  labels:
    severity: page
    application: rabbitmq-server
  annotations:
    description: RabbitMQ split brain detected
    summary: RabbitMQ split brain detected

Thanks @bd808

I've been playing with that metric

image.png (1×2 px, 109 KB)

I don't think it would have detected the situation we had today reliably. Or perhaps we didn't have a splitbrain after all?