Per IRC conversation with @fgiunchedi , this graphite-based alert has been in UNKNOWN state for the last 22 days as I write this.
Creating this ticket to:
- Investigate the root cause of missing metrics
- Fix or remove this alert.
Per IRC conversation with @fgiunchedi , this graphite-based alert has been in UNKNOWN state for the last 22 days as I write this.
Creating this ticket to:
Per 1x1 with @dcausse , the healthcheck's metric "MediaWiki.CirrusSearch.eqiad.backend_failure.memory_issue.count" has aged out of Graphite, as the alert conditions haven't been met for more than 6 months.
However, we still need to alert on these conditions. David suggested switching to a logstash alert, which seems reasonable. We will discuss this further with @EBernhardson and the rest of the team before taking action. In the meantime, I've silenced this alert for the next 21 days.
Thank you for the quick update, from a very quick look I concur that switching to logstash-based metrics/alerts is the right thing to do here; I believe the switch will actually cover two alerts out of modules/role/manifests/elasticsearch/alerts.pp:
monitoring::graphite_threshold { 'search_backend_failure_count': monitoring::graphite_threshold { 'search_backend_memory_issue_count':
Whereas the other alerts in the file are based on mediawiki metrics themselves rather than "events" we can infer from logs, is that correct ?
Per IRC conversation with cwhite: "We usually generate metrics from logs then alert on those metrics. An example can be found in alerts/team-sre/mediawiki.yaml log_mediawiki_servergroup_level_channel_doc_count"
For reference, the full list of search-related graphite alerts: T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting