Page MenuHomePhabricator

Add alerting for Memcached timeout errors
Closed, ResolvedPublic

Description

During a recent incident it was noticed that memcached timeout errors do not seem to trigger alarm. Can we investigate and add metrics and alerts

Event Timeline

jbond triaged this task as Medium priority.Mar 31 2021, 12:51 PM
jbond created this task.
akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

Joe moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.
Joe subscribed.

This task is so sparse, and so much time has passed, that I'm not sure what the point is here.

Do we want to alert on the number of memcached errors from mediawiki? I will assume that's the case.

We already added such an alert (porting it from check_prometheus) that is also by cluster, introduced in https://gerrit.wikimedia.org/r/c/operations/alerts/+/883950

I'll close this task for now.