Maniphest T133179

Redis monitoring needs to be improved
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Joe
	Apr 20 2016, 5:17 PM

Tags

Referenced Files

None

Subscribers

Description

Redis monitoring and alarming could be better:

We collect data on diamond but besides an occasional grafana dashboard, we do nothing with those. We should look at them for trends, hotspots and start alarming on those.
Our current replication monitoring is pretty lame as it falls victim of the Great Puppet Monitoring Race Condition: For redises in a multi-dc setup, when we switch the replication flow from one site to the other, the replication flow gets inverted by the puppet run on the hosts; until they've completed running AND puppet has completed running on the monitoring host we have a discrepancy between what we're testing for and what we're actually configuring. This results in a ton of false positives that we want to avoid.
We probably want to alarm not just on trends recorded to grafana, but on other facts like io starvation, swarms of connections, etc

See also:

Details

	Subject	Repo	Branch	Lines +/-
	redis::monitoring::instance: partially disable replication checks	operations/puppet	production	+3 -2

Customize query in gerrit

Related Objects

Mentioned Here: T79668: Need to monitor JobQueue especially to check if it is stuck in a futex deadlock
T79687: Establish monitoring thresholds for job queue

Event Timeline

Joe created this task.Apr 20 2016, 5:17 PM

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptApr 20 2016, 5:17 PM

Joe claimed this task.Apr 20 2016, 5:17 PM

Joe triaged this task as Medium priority.

Change 284489 had a related patch set uploaded (by Giuseppe Lavagetto):
redis::monitoring::instance: partially disable replication checks

https://gerrit.wikimedia.org/r/284489

gerritbot added a project: Patch-For-Review.Apr 20 2016, 5:18 PM

Change 284489 merged by Giuseppe Lavagetto:
redis::monitoring::instance: partially disable replication checks

https://gerrit.wikimedia.org/r/284489

Joe removed Joe as the assignee of this task.Apr 27 2016, 10:58 AM

Krinkle merged a task: T79687: Establish monitoring thresholds for job queue.Jul 7 2017, 9:11 PM

Krinkle removed a project: Patch-For-Review.

Krinkle updated the task description. (Show Details)

Krinkle added a subscriber: • RobLa-WMF.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:37 PM

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:15 PM