Page MenuHomePhabricator

Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc)
Closed, ResolvedPublic

Description

We need to monitor redis disk and memory usage for anomalies, and alarm on that.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150825-Redis

Event Timeline

Joe raised the priority of this task from to Needs Triage.
Joe updated the task description. (Show Details)
Joe added projects: acl*sre-team, observability.
Joe subscribed.
akosiaris subscribed.

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

Joe removed Joe as the assignee of this task.Oct 3 2016, 2:08 PM
Joe added a project: User-Joe.

To clarify, this task and the linked incident, are about the rdb* hosts. These are known to MW as redis_lock and in monitoring as redis_misc. I'm mentioning this because it means the task remains relevant after T267581: Phase out "redis_sessions" cluster and away from memcached cluster.

Krinkle renamed this task from Monitor redis memory/disk usage to Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc).Aug 23 2022, 2:16 PM

Removing SRE, triaging to serviceops. redis_misc is in our care as a team and we should decide what we want to do regarding better monitoring of it.

We're already alerting on disk space for all servers, not sure why this would be different.

Adding an alert on a full memory for the redis datastore can work, but needs silencing of the instances used by ORES.

I'd keep it simple instead of making the whole thing overly complex, thus just alerting on

redis_memory_used_bytes / redis_memory_max_bytes > 0.98

and manually silencing the three ores-related instances.

Change 901141 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] sre: add redis memory full alert

https://gerrit.wikimedia.org/r/901141

Change 901141 merged by jenkins-bot:

[operations/alerts@master] sre: add redis memory full alert

https://gerrit.wikimedia.org/r/901141