Page MenuHomePhabricator

Log unactionable errors to statslib/prometheus and set alert instead of using logstash
Open, Needs TriagePublic

Description

We actively log some errors to the GrowthExperiments channel in logstash that are inherently not actionable to us. This is very suboptimal, because it both creates a lot of useless noise in our Dashboard on logstash, and it provides very little visibility in the change of the error rate of those errors.

I think that this affects the following errors:

  • Search error: We could not complete your search due to a temporary problem. Please try again later.
  • Link suggestion not found for "{parameter1}
  • No recommendation found for page: {parameter1} (T366010)
  • Probably also for Failed to load site edits per day stat: {status} for the "connection timeout" status

Acceptance criteria:
For all three errors:

  • the error is no longer logged to logstash
  • the error is logged to statsd/graphite
  • there is a Grafana dashboard that shows a panel with the number of those errors in some sensible interval
  • there is an alert on that dashboard that sends an email out if the a given threshold is exceeded
  • glancing at that dashboard is part of the Growth Team chores

Open questions:

  • what should the threshold be? (needs to be defined for each metric separately)
  • who should receive that alert if the threshold is exceeded?

Notes:

Event Timeline

Michael renamed this task from Log unactionable errors to statsd/graphite and set alert instead of using logstash to Log unactionable errors to statslib/prometheus and set alert instead of using logstash.Sep 25 2024, 9:18 AM
Michael updated the task description. (Show Details)
lmata moved this task from Inbox to Radar on the observability board.
lmata moved this task from Inbox to Radar on the SRE Observability board.