Page MenuHomePhabricator

Detect and alert on rabbitmq splitbrain/partition
Open, Needs TriagePublic

Description

On 2023-04-18 I discovered that rabbitmq was partitioned and cloudrabbit1003 was not talking to the other two hosts.

I discovered this via poor openstack behavior... there was no email or other alert about the issue. There's a 'split brain' chart on the grafana dashboard for this service but it doesn't show the outage (or, rather, it shows there as having always been split brain forever).

https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health?orgId=1&from=now-7d&to=now&viewPanel=6

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Just fyi. there's no such support yet on the stats exported by rabbit itself (the ones we use): https://github.com/rabbitmq/rabbitmq-server/issues/2508

Change 911927 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] rabbitmq: add a single-purpose metric to detect network partition

https://gerrit.wikimedia.org/r/911927

Change 911927 merged by Andrew Bogott:

[operations/puppet@production] rabbitmq: add a single-purpose metric to detect network partition

https://gerrit.wikimedia.org/r/911927

Change 912291 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] detect_rabbit_partition: make executable

https://gerrit.wikimedia.org/r/912291

Change 912291 merged by Andrew Bogott:

[operations/puppet@production] detect_rabbit_partition: make executable

https://gerrit.wikimedia.org/r/912291

Change 912331 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] detect_rabbit_partition: fix metric name and tag

https://gerrit.wikimedia.org/r/912331

Change 912331 merged by Andrew Bogott:

[operations/puppet@production] detect_rabbit_partition: fix metric name and tag

https://gerrit.wikimedia.org/r/912331

Change 912865 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/alerts@master] Add rabbitmq_network_partition alert

https://gerrit.wikimedia.org/r/912865

Change 912865 merged by jenkins-bot:

[operations/alerts@master] Add rabbitmq_network_partition alert

https://gerrit.wikimedia.org/r/912865

Change 913957 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/alerts@master] rabbitmq_network_partition: move the rabbitmq alert from 'cloud' to 'eqiad'

https://gerrit.wikimedia.org/r/913957

Change 913957 merged by jenkins-bot:

[operations/alerts@master] rabbitmq_network_partition: move the rabbitmq alert from 'cloud' to 'eqiad'

https://gerrit.wikimedia.org/r/913957