We actively log some errors to the GrowthExperiments channel in logstash that are inherently not actionable to us. This is very suboptimal, because it both creates a lot of useless noise in our Dashboard on logstash, and it provides very little visibility in the change of the error rate of those errors.
I think that this affects the following errors:
- Search error: We could not complete your search due to a temporary problem. Please try again later.
- Link suggestion not found for "{parameter1}
- No recommendation found for page: {parameter1} (T366010)
- Probably also for Failed to load site edits per day stat: {status} for the "connection timeout" status
Acceptance criteria:
For all three errors:
- the error is no longer logged to logstash
- the error is logged to statsd/graphite
- there is a Grafana dashboard that shows a panel with the number of those errors in some sensible interval
- there is an alert on that dashboard that sends an email out if the a given threshold is exceeded
- glancing at that dashboard is part of the Growth Team chores
Open questions:
- what should the threshold be? (needs to be defined for each metric separately)
- who should receive that alert if the threshold is exceeded?
Notes:
- For the "Search error: {message}" this for now only applies to the specific error message above. There are other tasks with other error messages that might be actionable by us. We must match on the specific message or some other specific identifier, not just on "Search error:"
- See also T328128: Reduce noise in Growth team's Logstash dashboard for describing the general problem, and T328129: Implement alerting when spikes occur in Growth team dashboard in Logstash for requiring a solution for alerting on logstash for those errors that should actually be there
- There does seem to already exist some logging of the overall volume per channel from logstash to some resource accessible by Grafana: https://wikimedia.slack.com/archives/C05H0JYT85V/p1726067552071909 - maybe we can somehow make use of that?