Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Aug 25 2015, 9:12 AM

Description

We need to monitor redis disk and memory usage for anomalies, and alarm on that.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150825-Redis

Details

	Subject	Repo	Branch	Lines +/-
	sre: add redis memory full alert	operations/alerts	master	+53 -0

Customize query in gerrit

Related Objects

Mentioned Here: T267581: Phase out "redis_sessions" cluster and away from memcached cluster
T148637: Port redis statistics to Prometheus

Event Timeline

Joe created this task.Aug 25 2015, 9:12 AM

Joe raised the priority of this task from to Needs Triage.

Joe updated the task description. (Show Details)

Joe added projects: acl*sre-team, observability.

Joe subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 25 2015, 9:12 AM

akosiaris triaged this task as High priority.Aug 25 2015, 12:44 PM

akosiaris subscribed.

• chasemp added a project: Incident-20150825-Redis.Aug 25 2015, 12:56 PM

• chasemp set Security to None.

Joe claimed this task.May 17 2016, 7:15 AM

greg added a project: Wikimedia-Incident.Jul 28 2016, 10:17 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Jul 28 2016, 10:18 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

Joe removed Joe as the assignee of this task.Oct 3 2016, 2:08 PM

Joe added a project: User-Joe.

Adding a reference to the outage mentioned:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150825-Redis

elukey unsubscribed.Oct 19 2016, 12:55 PM

elukey subscribed.

I created https://phabricator.wikimedia.org/T148637 to add Redis metrics to Prometheus.

Should we merge this with T148637 ?

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:16 PM

jijiki added a project: User-jijiki.Feb 27 2019, 7:44 PM

jijiki moved this task from Incoming🐅 to Radar 📻 on the User-jijiki board.Apr 4 2019, 9:25 PM

greg unsubscribed.Apr 4 2019, 9:38 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:15 PM

Krinkle updated the task description. (Show Details)Sep 28 2021, 8:47 PM

To clarify, this task and the linked incident, are about the rdb* hosts. These are known to MW as redis_lock and in monitoring as redis_misc. I'm mentioning this because it means the task remains relevant after T267581: Phase out "redis_sessions" cluster and away from memcached cluster.

Krinkle renamed this task from Monitor redis memory/disk usage to Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc).Aug 23 2022, 2:16 PM

Removing SRE, triaging to serviceops. redis_misc is in our care as a team and we should decide what we want to do regarding better monitoring of it.

Clement_Goubert moved this task from Incoming 🐫 to 💾 Datastores on the serviceops board.Mar 15 2023, 11:53 AM

We're already alerting on disk space for all servers, not sure why this would be different.

Adding an alert on a full memory for the redis datastore can work, but needs silencing of the instances used by ORES.

I'd keep it simple instead of making the whole thing overly complex, thus just alerting on

redis_memory_used_bytes / redis_memory_max_bytes > 0.98

and manually silencing the three ores-related instances.

Joe claimed this task.Mar 20 2023, 10:44 AM

Joe added a project: SRE-Sprint-Week-Sustainability-March2023.

Joe moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.

Change 901141 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] sre: add redis memory full alert

https://gerrit.wikimedia.org/r/901141

gerritbot added a project: Patch-For-Review.Mar 20 2023, 11:01 AM

Change 901141 merged by jenkins-bot:

[operations/alerts@master] sre: add redis memory full alert

https://gerrit.wikimedia.org/r/901141

Maintenance_bot removed a project: Patch-For-Review.Mar 21 2023, 1:11 PM

Joe closed this task as Resolved.Mar 21 2023, 2:30 PM

Joe moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 24 2023, 9:59 AM

Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc)
Closed, ResolvedPublic
Actions