Page MenuHomePhabricator

Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting
Open, HighPublic

Description

This task tracks the conversion of mw metrics used for graphite-based alerting in Icinga.

modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp
  • mediawiki_cirrus_update_rate_${site}
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.${site}.updates.all.sent.rate
  • mediawiki_cirrus_pool_counter_rejections_rate
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.poolCounter.*.failureMs.sample_rate
  • mediawiki_cirrussearch_indices_high_fix_rate
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.{eqiad,codfw,cloudelastic}.sanitization.fixed.sum
modules/profile/manifests/graphite/alerts.pp
  • mediawiki_session_loss
    • Core: EditPage->incrementEditFailureStats()
    • MediaWiki.edit.failures.session_loss.rate
  • mediawiki_bad_token
    • Core: EditPage->incrementEditFailureStats()
    • MediaWiki.edit.failures.bad_token.rate
  • mediawiki_centralauth_errors
  • mediawiki_accountcreation_errors
modules/role/manifests/elasticsearch/alerts.pp
  • cirrussearch_eqiad_fulltext_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.full_text.p95
  • cirrussearch_eqiad_compsuggest_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.comp_suggest.p95
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.comp_suggest.sample_rate
  • cirrussearch_eqiad_morelike_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.requestTimeMs.more_like.p95
  • cirrussearch_codfw_fulltext_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.full_text.p95
  • cirrussearch_codfw_compsuggest_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.comp_suggest.p95
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.comp_suggest.sample_rate
  • cirrussearch_codfw_morelike_95th_percentile
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.codfw.requestTimeMs.more_like.p95
  • search_backend_failure_count (related: T355795: Fix "requests triggering circuit breakers" Elastic alert) Using envoy telemetry now
    • CirrusSearch Extension
    • MediaWiki.CirrusSearch.eqiad.backend_failure.failed.count

Event Timeline

Change 972356 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[mediawiki/core@master] EditPage.php: convert edit failures count to new Stats library

https://gerrit.wikimedia.org/r/972356

Change 972356 merged by jenkins-bot:

[mediawiki/core@master] EditPage.php: convert edit failures count to new Stats library

https://gerrit.wikimedia.org/r/972356

Change 991007 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add mw edit failures alert

https://gerrit.wikimedia.org/r/991007

Change 991008 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: remove mw edit failures graphite alerts

https://gerrit.wikimedia.org/r/991008

Change 991007 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add mw edit failures alert

https://gerrit.wikimedia.org/r/991007

Change 991008 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: remove mw edit failures graphite alerts

https://gerrit.wikimedia.org/r/991008

Change 993661 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move MediaWikiEditFailures alert to global

https://gerrit.wikimedia.org/r/993661

Change 993661 merged by Filippo Giunchedi:

[operations/alerts@master] sre: move MediaWikiEditFailures alert to global

https://gerrit.wikimedia.org/r/993661

Change 994185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[mediawiki/extensions/WikimediaEvents@master] AuthManager: increment Stats counters too

https://gerrit.wikimedia.org/r/994185

Change 994185 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] AuthManager: increment Stats counters too

https://gerrit.wikimedia.org/r/994185