Page MenuHomePhabricator

Migrate role::alerting_host to Buster
Closed, ResolvedPublic

Description

As per title, all hosts running Icinga should be on Buster. Ideally by FQ2 FY20-21.

root@cumin1001:~# cumin 'P{O:alerting_host} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
icinga[1001,2001].wikimedia.org
DRY-RUN mode enabled, aborting

Action plan:

  • Apply the role to alert2001 (new icinga host, T252032)
  • Verify all things are working as expected, i.e. state matches icinga1001
  • alert latency is acceptable, cpu load is lower (hw is more powerful) etc
  • Make sure metamonitoring checks for alert[12]001 are passing
  • Switchover icinga1001 -> alert1001 (enable notifications for alert1001, switch DNS)
  • Enable meta monitoring notifications for alert1001 and alert2001 (via root crontab on wikitech-static)

Event Timeline

I tested the role on Buster in WMCS and it appears to be working as expected!

Change 618345 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] alerting_host: assign alert[12]001 role::alerting_host

https://gerrit.wikimedia.org/r/618345

Change 618345 merged by Herron:
[operations/puppet@production] alerting_host: assign alert[12]001 role::alerting_host

https://gerrit.wikimedia.org/r/618345

Change 618545 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] acme_cheif: permit alert[12]001 to fetch icinga cert

https://gerrit.wikimedia.org/r/618545

Change 618545 merged by Herron:
[operations/puppet@production] acme_cheif: add alert[12]001 SNI and permit to fetch icinga cert

https://gerrit.wikimedia.org/r/618545

Change 618719 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add alert[12]001 to monitoring hosts

https://gerrit.wikimedia.org/r/618719

Change 618719 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add alert[12]001 to monitoring hosts

https://gerrit.wikimedia.org/r/618719

Change 620701 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: refactor logmsgbot to follow Icinga failover

https://gerrit.wikimedia.org/r/620701

Something else I noticed while looking at icinga logs: check_nrpe for jessie hosts fails:

etcd1003;puppet last run;CRITICAL;SOFT;2;CHECK_NRPE: (ssl_err != 5) Error - Could not complete SSL handshake with 10.64.0.42: 1

Digging further this turns out to be too short dh params on the nrpe server side (from check_nrpe logs)

[1597742562] SSL Certificate File: None
[1597742562] SSL Private Key File: None
[1597742562] SSL CA Certificate File: None
[1597742562] SSL Cipher List: ALL:!MD5:@STRENGTH:@SECLEVEL=0
[1597742562] SSL Allow ADH: 1
[1597742562] SSL Log Options: 0xff
[1597742562] SSL Version: TLSv1_plus And Above
[1597742562] Connected to 10.64.0.42
[1597742562] Error: (ERR_get_error_line_data = 337260938), Could not complete SSL handshake with 10.64.0.42: dh key too small

Change 620929 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: make sure update-etcd-mw-config-lastindex is enabled

https://gerrit.wikimedia.org/r/620929

Change 620701 merged by Filippo Giunchedi:
[operations/puppet@production] profile: refactor logmsgbot to follow Icinga failover

https://gerrit.wikimedia.org/r/620701

Change 620929 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: make sure update-etcd-mw-config-lastindex is enabled

https://gerrit.wikimedia.org/r/620929

herron updated the task description. (Show Details)

Change 629408 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: switch active server from icinga1001 to alert1001

https://gerrit.wikimedia.org/r/629408

Change 629415 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: point icinga CNAMEs to alert1001

https://gerrit.wikimedia.org/r/629415

Mentioned in SAL (#wikimedia-operations) [2020-09-23T16:00:44Z] <herron> switching icinga over from icinga1001 to alert1001 T247966

Change 629415 merged by Herron:
[operations/dns@master] dns: point icinga CNAMEs to alert1001

https://gerrit.wikimedia.org/r/629415

Change 629408 merged by Herron:
[operations/puppet@production] icinga: switch active server from icinga1001 to alert1001

https://gerrit.wikimedia.org/r/629408

Alert1001 is now the active Icinga server. Meta monitoring for alert[12]001 has been enabled as well.

There may be a few alerts needing TLC due to the transition, either from service checks changing from host icinga1001 to alert1001 or due to other migration related issues. Please join me in keeping an eye out for these.

What's up with icinga1001/icinga2001, they are still up and running?

Change 677656 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] remove alerting_host role from icinga[12]001

https://gerrit.wikimedia.org/r/677656

Change 677656 abandoned by Cwhite:

[operations/puppet@production] remove alerting_host role from icinga[12]001

Reason:

superseded

https://gerrit.wikimedia.org/r/677656

lmata claimed this task.

Mentioned in SAL (#wikimedia-operations) [2021-07-16T15:14:41Z] <godog> set alert2001 as active in netbox (was staged) - T247966