Page MenuHomePhabricator

Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting
Closed, ResolvedPublic

Description

This task tracks the conversion of mw metrics used for graphite-based alerting in Icinga.

modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp

AM patch: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1054317/2
Patch to remove from puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054647

  • mediawiki_cirrus_update_rate_${site}
  • mediawiki_cirrus_pool_counter_rejections_rate
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.poolCounter.*.failureMs.sample_rate
  • mediawiki_cirrussearch_indices_high_fix_rate
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.{eqiad,codfw,cloudelastic}.sanitization.fixed.sum
modules/profile/manifests/graphite/alerts.pp
  • mediawiki_session_loss
    • Core: EditPage->incrementEditFailureStats()
    • MediaWiki.edit.failures.session_loss.rate
  • mediawiki_bad_token
    • Core: EditPage->incrementEditFailureStats()
    • MediaWiki.edit.failures.bad_token.rate
  • mediawiki_centralauth_errors
  • mediawiki_accountcreation_errors
modules/role/manifests/elasticsearch/alerts.pp

AM patch: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1054374/
Puppet cleanup patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054647

  • cirrussearch_eqiad_fulltext_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.full_text.p95
  • cirrussearch_eqiad_compsuggest_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.comp_suggest.p95
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.comp_suggest.sample_rate
  • cirrussearch_eqiad_morelike_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.more_like.p95
  • cirrussearch_codfw_fulltext_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.full_text.p95
  • cirrussearch_codfw_compsuggest_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.comp_suggest.p95
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.comp_suggest.sample_rate
  • cirrussearch_codfw_morelike_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.more_like.p95
  • search_backend_failure_count (related: T355795: Fix "requests triggering circuit breakers" Elastic alert) Using envoy telemetry now
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.backend_failure.failed.count

Event Timeline

Change 972356 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[mediawiki/core@master] EditPage.php: convert edit failures count to new Stats library

https://gerrit.wikimedia.org/r/972356

Change 972356 merged by jenkins-bot:

[mediawiki/core@master] EditPage.php: convert edit failures count to new Stats library

https://gerrit.wikimedia.org/r/972356

Change 991007 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add mw edit failures alert

https://gerrit.wikimedia.org/r/991007

Change 991008 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: remove mw edit failures graphite alerts

https://gerrit.wikimedia.org/r/991008

Change 991007 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add mw edit failures alert

https://gerrit.wikimedia.org/r/991007

Change 991008 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: remove mw edit failures graphite alerts

https://gerrit.wikimedia.org/r/991008

Change 993661 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move MediaWikiEditFailures alert to global

https://gerrit.wikimedia.org/r/993661

Change 993661 merged by Filippo Giunchedi:

[operations/alerts@master] sre: move MediaWikiEditFailures alert to global

https://gerrit.wikimedia.org/r/993661

Change 994185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[mediawiki/extensions/WikimediaEvents@master] AuthManager: increment Stats counters too

https://gerrit.wikimedia.org/r/994185

Change 994185 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] AuthManager: increment Stats counters too

https://gerrit.wikimedia.org/r/994185

EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.

Not sure where to assign this, but from the search side our alerts are all moved over. The centralauth and account creation alerts appear to still exist in puppet.

Thank you @EBernhardson and team for this! I'll untag non-o11y teams now since we have to migrate centralauth alerts only at this point

Change #1071161 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] mediawiki: port login failures alert from icinga/statsd

https://gerrit.wikimedia.org/r/1071161

Change #1071165 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] mediawiki: port account creation failures alert from icinga/statsd

https://gerrit.wikimedia.org/r/1071165

Change #1071193 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: remove mw graphite-based alerts

https://gerrit.wikimedia.org/r/1071193

Change #1071161 merged by Filippo Giunchedi:

[operations/alerts@master] mediawiki: port login failures alert from icinga/statsd

https://gerrit.wikimedia.org/r/1071161

Change #1071165 merged by Filippo Giunchedi:

[operations/alerts@master] mediawiki: port account creation failures alert from icinga/statsd

https://gerrit.wikimedia.org/r/1071165

I have merged the ported mw alerts (thank you @Clement_Goubert !) and changed the referenced dashboard at https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts to show the prometheus metrics.

Once https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071193 ships we can call this task done!

Change #1071193 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: remove mw graphite-based alerts

https://gerrit.wikimedia.org/r/1071193

fgiunchedi updated the task description. (Show Details)

This is done! The only other use of graphite_threshold is 'zuul_gearman_wait_queue' which will be addressed as part of T233089: Export zuul metrics to Prometheus. There are of course graphite-itself alerts left, which will be removed together with graphite.

Change #1072657 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: tweak MediaWikiLoginFailures threshold

https://gerrit.wikimedia.org/r/1072657

Change #1072657 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: tweak MediaWikiLoginFailures threshold

https://gerrit.wikimedia.org/r/1072657