Page MenuHomePhabricator

check icinga for any alerts custom to sre-collab related services (convert monitoring::service checks to prometheus)
Closed, ResolvedPublic

Description

  • modules/profile/manifests/etherpad.pp: monitoring::service { 'etherpad-lite-http':
  • modules/profile/manifests/etherpad.pp: nrpe::monitor_service { 'etherpad-lite-proc':
  • modules/profile/manifests/releases/common.pp: monitoring::service { 'https_releases':
  • modules/profile/manifests/releases/mediawiki.pp: monitoring::service { 'http_releases_jenkins':
  • modules/profile/manifests/gerrit/proxy.pp: monitoring::service { 'https':
  • modules/profile/manifests/microsites/peopleweb.pp: monitoring::service { 'https-peopleweb':
  • modules/profile/manifests/microsites/peopleweb.pp: monitoring::service { 'https-peopleweb-expiry':
  • modules/profile/manifests/microsites/static_codereview.pp: monitoring::service { 'static-codereview-http':
  • modules/profile/manifests/microsites/static_rt.pp: monitoring::service { 'static-rt-https':
  • modules/profile/manifests/vrts.pp: monitoring::service { 'smtp':
  • modules/profile/manifests/phabricator/main.pp: monitoring::service { 'smtp':
  • modules/profile/manifests/gerrit.pp: monitoring::service { 'gerrit_ssh':

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -5
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+13 -11
operations/puppetproduction+1 -1
operations/puppetproduction+11 -0
operations/puppetproduction+0 -7
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
operations/puppetproduction+6 -10
operations/puppetproduction+4 -4
operations/puppetproduction+2 -1
operations/puppetproduction+7 -4
operations/puppetproduction+13 -7
operations/puppetproduction+8 -12
operations/puppetproduction+0 -6
operations/puppetproduction+3 -2
operations/puppetproduction+8 -4
Show related patches Customize query in gerrit

Event Timeline

The point is to go through Icinga, look at all sre-collab owned hosts and services and identify if there are any custom checks that are NOT base checks that every host has, like disk space, CPU etc.. and that have NOT already been replaced by recently added blackbox::http checks.

If there are none.. this is done..

If there are some that have been replaced by blackbox checks.. remove them.

If there is anything else special.. ask how they can be converted to alertmanager.

(sprint week related)

Dzahn triaged this task as Medium priority.Mar 23 2023, 5:43 PM
Dzahn changed the task status from Open to In Progress.Mar 23 2023, 5:46 PM

Icinga alerts to convert:

convert to blackbox::http check for collab team:

modules/profile/manifests/etherpad.pp: monitoring::service { 'etherpad-lite-http':
modules/profile/manifests/releases/common.pp: monitoring::service { 'https_releases':
modules/profile/manifests/releases/mediawiki.pp: monitoring::service { 'http_releases_jenkins':
modules/profile/manifests/gerrit/proxy.pp: monitoring::service { 'https':
modules/profile/manifests/microsites/peopleweb.pp: monitoring::service { 'https-peopleweb':
modules/profile/manifests/microsites/peopleweb.pp: monitoring::service { 'https-peopleweb-expiry':
modules/profile/manifests/microsites/static_codereview.pp: monitoring::service { 'static-codereview-http':
modules/profile/manifests/microsites/static_rt.pp: monitoring::service { 'static-rt-https':

serviceops:

modules/noc/manifests/init.pp: monitoring::service { 'https-noc':
modules/noc/manifests/init.pp: monitoring::service { 'https-noc-ssl-expiry':

not sure yet how to replace:

modules/profile/manifests/vrts.pp: monitoring::service { 'smtp':
modules/profile/manifests/phabricator/main.pp: monitoring::service { 'smtp':
modules/profile/manifests/gerrit.pp: monitoring::service { 'gerrit_ssh':

Change 902783 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902783

Change 902785 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: remove Icinga monitoring

https://gerrit.wikimedia.org/r/902785

Change 902788 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases-jenkins: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902788

Change 902799 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902799

Change 902801 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902801

Change 902802 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] miscweb/static_rt: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902802

Change 902802 merged by Dzahn:

[operations/puppet@production] miscweb/static_rt: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902802

Change 903318 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] alertmanager: send sre-collab alerts to -operations and -sre-collab

https://gerrit.wikimedia.org/r/903318

Change 903318 merged by Dzahn:

[operations/puppet@production] alertmanager: send sre-collab alerts to -operations and -sre-collab

https://gerrit.wikimedia.org/r/903318

Change 902801 merged by Dzahn:

[operations/puppet@production] peopleweb: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902801

Change 902785 merged by Dzahn:

[operations/puppet@production] releases: remove Icinga monitoring

https://gerrit.wikimedia.org/r/902785

Change 902783 merged by Dzahn:

[operations/puppet@production] etherpad: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902783

Change 902788 merged by Dzahn:

[operations/puppet@production] releases-jenkins: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/902788

Change 903801 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] noc: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/903801

Change 903805 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts: replace Icinga with Prometheus for SMTP monitoring

https://gerrit.wikimedia.org/r/903805

Change 903826 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: replace Icinga with Prometheus for SMTP monitoring

https://gerrit.wikimedia.org/r/903826

Change 904173 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] releases: rename new blackbox check for jenkins login page

https://gerrit.wikimedia.org/r/904173

Change 904173 merged by Dzahn:

[operations/puppet@production] releases: rename new blackbox check for jenkins login page

https://gerrit.wikimedia.org/r/904173

Change 903826 merged by Dzahn:

[operations/puppet@production] phabricator: replace Icinga with Prometheus for SMTP monitoring

https://gerrit.wikimedia.org/r/903826

Change 903801 merged by Dzahn:

[operations/puppet@production] noc: replace Icinga with Prometheus monitoring

https://gerrit.wikimedia.org/r/903801

Change 903805 merged by Dzahn:

[operations/puppet@production] vrts: replace Icinga with Prometheus for SMTP monitoring

https://gerrit.wikimedia.org/r/903805

Change 904856 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: remove process monitoring

https://gerrit.wikimedia.org/r/904856

Change 904857 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418

https://gerrit.wikimedia.org/r/904857

Change 905178 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] noc: Fix alertmanager severity

https://gerrit.wikimedia.org/r/905178

Change 905178 merged by Clément Goubert:

[operations/puppet@production] noc: Fix alertmanager severity

https://gerrit.wikimedia.org/r/905178

Change 904856 merged by Dzahn:

[operations/puppet@production] etherpad: remove process monitoring

https://gerrit.wikimedia.org/r/904856

gerrit monitoring switch still in discussion/review but will be done as part of T329587

Dzahn renamed this task from check icinga for any alerts custom to sre-collab related services to check icinga for any alerts custom to sre-collab related services (convert monitoring::service checks to prometheus).Apr 6 2023, 6:20 PM

This is done but there is a continuation of it for a different class of monitoring checks. T334250

Change 904857 merged by Dzahn:

[operations/puppet@production] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418

https://gerrit.wikimedia.org/r/904857

Change 902799 merged by Dzahn:

[operations/puppet@production] gerrit: add Prometheus blackbox https monitoring

https://gerrit.wikimedia.org/r/902799

Change 913262 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: follow_redirects in blackbox::http monitoring

https://gerrit.wikimedia.org/r/913262

Change 913262 merged by Dzahn:

[operations/puppet@production] gerrit: follow_redirects in blackbox::http monitoring

https://gerrit.wikimedia.org/r/913262

Change 913272 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: accept http status 404 in blackbox http monitor, for now

https://gerrit.wikimedia.org/r/913272

Change 913272 merged by Dzahn:

[operations/puppet@production] gerrit: accept http status 404 in blackbox http monitor, for now

https://gerrit.wikimedia.org/r/913272

Change 913273 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: accept 200 in addition to 302 and 404 in monitoring

https://gerrit.wikimedia.org/r/913273

Change 913273 merged by Dzahn:

[operations/puppet@production] gerrit: accept 200 in addition to 302 and 404 in monitoring

https://gerrit.wikimedia.org/r/913273

Change 913275 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: do not monitor the replica

https://gerrit.wikimedia.org/r/913275

Change 913275 merged by Dzahn:

[operations/puppet@production] gerrit: do not monitor the replica

https://gerrit.wikimedia.org/r/913275