Page MenuHomePhabricator

create a wmcs alerting group in icinga and review alerting
Closed, ResolvedPublic

Description

There are a number of things that it would be beneficial to get notified about which are not useful to the greater team. I believe analytics has a similar group for very context specific alerts.

Candidates:

  • nova-fullstack errors
  • labtest* or deployments other than main issues
  • Toolforge proxy checks that do not make sense to go to everyone
  • Toolforge general issues

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -5
operations/puppetproduction+1 -1
operations/puppetproduction+57 -26
operations/puppetproduction+168 -100
operations/puppetproduction+14 -2
operations/puppetproduction+6 -6
operations/puppetproduction+29 -16
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+11 -4
operations/puppetproduction+16 -7
operations/puppetproduction+21 -8
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+30 -24
operations/puppetproduction+7 -5
operations/puppetproduction+4 -1
operations/puppetproduction+4 -2
operations/puppetproduction+7 -5
Show related patches Customize query in gerrit

Event Timeline

chasemp renamed this task from create a wmcs alerting group in icinga to create a wmcs alerting group in icinga and review alerting.Feb 22 2018, 9:12 PM

tested a manual failure of this for labvirt1018 using nova to stop the instance naturally. It did alert IRC -operations and the wmcs-team group.

RECOVERY - ensure kvm processes are running on labvirt1018 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm

Change 413481 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: monitoring change for nova-network and conntrack

https://gerrit.wikimedia.org/r/413481

Change 413481 merged by Rush:
[operations/puppet@production] openstack: monitoring change for nova-network and conntrack

https://gerrit.wikimedia.org/r/413481

Change 413486 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova-fullstack alert wmcs-team

https://gerrit.wikimedia.org/r/413486

Change 413486 merged by Rush:
[operations/puppet@production] openstack: nova-fullstack alert wmcs-team

https://gerrit.wikimedia.org/r/413486

Change 413491 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: glance monitoring main should alert wmcs-team

https://gerrit.wikimedia.org/r/413491

Change 413491 merged by Rush:
[operations/puppet@production] openstack: glance monitoring main should alert wmcs-team

https://gerrit.wikimedia.org/r/413491

Change 413626 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: pass critical from deployment

https://gerrit.wikimedia.org/r/413626

Change 413626 merged by Rush:
[operations/puppet@production] openstack: pass critical from deployment

https://gerrit.wikimedia.org/r/413626

Change 413634 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: designate pass monitoring values through deployment

https://gerrit.wikimedia.org/r/413634

Change 413634 merged by Rush:
[operations/puppet@production] openstack: designate pass monitoring values through deployment

https://gerrit.wikimedia.org/r/413634

Change 413636 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: designate pass critical from deployment typo

https://gerrit.wikimedia.org/r/413636

Change 413636 merged by Rush:
[operations/puppet@production] openstack: designate pass critical from deployment typo

https://gerrit.wikimedia.org/r/413636

Change 413766 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova-fullstack alert after 1 retry

https://gerrit.wikimedia.org/r/413766

Change 413766 merged by Rush:
[operations/puppet@production] openstack: nova-fullstack alert after 1 retry

https://gerrit.wikimedia.org/r/413766

Change 413770 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova-api set to critical based on deployment

https://gerrit.wikimedia.org/r/413770

Change 413770 merged by Rush:
[operations/puppet@production] openstack: nova-api set to critical based on deployment

https://gerrit.wikimedia.org/r/413770

Change 413772 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova-conductor critical by deployment

https://gerrit.wikimedia.org/r/413772

Change 413772 merged by Rush:
[operations/puppet@production] openstack: nova-conductor critical by deployment

https://gerrit.wikimedia.org/r/413772

Change 413778 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: monitor nova-scheduler as critical

https://gerrit.wikimedia.org/r/413778

Change 413778 merged by Rush:
[operations/puppet@production] openstack: monitor nova-scheduler as critical

https://gerrit.wikimedia.org/r/413778

Change 413788 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] icinga: wmcs-team set rush contact

https://gerrit.wikimedia.org/r/413788

Change 413788 merged by Rush:
[operations/puppet@production] icinga: wmcs-team set rush contact

https://gerrit.wikimedia.org/r/413788

Change 413793 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova-fullstack test alert on 2 tries

https://gerrit.wikimedia.org/r/413793

Change 413793 merged by Rush:
[operations/puppet@production] openstack: nova-fullstack test alert on 2 tries

https://gerrit.wikimedia.org/r/413793

Change 413800 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] toolforge: set alerting for tools.checker things

https://gerrit.wikimedia.org/r/413800

Change 413800 merged by Rush:
[operations/puppet@production] toolforge: set alerting for tools.checker things

https://gerrit.wikimedia.org/r/413800

Change 413804 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] toolforge: tools checker contact_groups add admins

https://gerrit.wikimedia.org/r/413804

Change 413804 merged by Rush:
[operations/puppet@production] toolforge: tools checker contact_groups add admins

https://gerrit.wikimedia.org/r/413804

Change 415083 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] icinga: creaet irc-cloud-feed channel for ircbot

https://gerrit.wikimedia.org/r/415083

Change 415083 merged by Rush:
[operations/puppet@production] icinga: create irc-cloud-feed channel for ircbot

https://gerrit.wikimedia.org/r/415083

Change 415167 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] icinga: change alerting for openstack things

https://gerrit.wikimedia.org/r/415167

Change 415167 merged by Rush:
[operations/puppet@production] icinga: change alerting for openstack things

https://gerrit.wikimedia.org/r/415167

Change 415283 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] labstore: monitoring changes for critical and contacts

https://gerrit.wikimedia.org/r/415283

Change 415283 merged by Rush:
[operations/puppet@production] labstore: monitoring changes for critical and contacts

https://gerrit.wikimedia.org/r/415283

Change 415300 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: labstore monitoring typo fix

https://gerrit.wikimedia.org/r/415300

Change 415300 merged by Rush:
[operations/puppet@production] openstack: labstore monitoring typo fix

https://gerrit.wikimedia.org/r/415300

Change 415874 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] toolforge: change test tolerance for paws and trusty grid

https://gerrit.wikimedia.org/r/415874

Change 415874 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: change test tolerance for paws and trusty grid

https://gerrit.wikimedia.org/r/415874

Change 416432 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: kvm monitoring threshold 75=>90

https://gerrit.wikimedia.org/r/416432

Change 416432 merged by Rush:
[operations/puppet@production] openstack: kvm monitoring threshold 75=>90

https://gerrit.wikimedia.org/r/416432

Change 416844 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] wmcs: update wmcs-team contacts for wmcs respective

https://gerrit.wikimedia.org/r/416844

Change 416844 merged by Rush:
[operations/puppet@production] wmcs: update wmcs-team contacts for wmcs respective

https://gerrit.wikimedia.org/r/416844