Page MenuHomePhabricator

Migrate role::alerting_host to Buster
Open, Needs TriagePublic

Description

As per title, all hosts running Icinga should be on Buster. Ideally by FQ2 FY20-21.

root@cumin1001:~# cumin 'P{O:alerting_host} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
icinga[1001,2001].wikimedia.org
DRY-RUN mode enabled, aborting

Action plan:

  • Apply the role to alert2001 (new icinga host, T252032)
  • Verify all things are working as expected, i.e. state matches icinga1001
  • alert latency is acceptable, cpu load is lower (hw is more powerful) etc
  • Make sure metamonitoring checks for alert[12]001 are passing
  • Switchover icinga1001 -> alert1001 (enable notifications for alert1001, switch DNS)
  • Enable meta monitoring notifications for alert1001 and alert2001 (via root crontab on wikitech-static)

Event Timeline

fgiunchedi edited projects, added observability; removed SRE.
herron added a subscriber: herron.Mar 23 2020, 3:34 PM
fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:33 PM

I tested the role on Buster in WMCS and it appears to be working as expected!

fgiunchedi updated the task description. (Show Details)Jul 10 2020, 9:08 AM

Change 618345 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] alerting_host: assign alert[12]001 role::alerting_host

https://gerrit.wikimedia.org/r/618345

herron moved this task from Backlog to In progress on the observability board.Aug 4 2020, 4:45 PM

Change 618345 merged by Herron:
[operations/puppet@production] alerting_host: assign alert[12]001 role::alerting_host

https://gerrit.wikimedia.org/r/618345

Change 618545 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] acme_cheif: permit alert[12]001 to fetch icinga cert

https://gerrit.wikimedia.org/r/618545

Change 618545 merged by Herron:
[operations/puppet@production] acme_cheif: add alert[12]001 SNI and permit to fetch icinga cert

https://gerrit.wikimedia.org/r/618545

Change 618719 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add alert[12]001 to monitoring hosts

https://gerrit.wikimedia.org/r/618719

Change 618719 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add alert[12]001 to monitoring hosts

https://gerrit.wikimedia.org/r/618719

Change 620701 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: refactor logmsgbot to follow Icinga failover

https://gerrit.wikimedia.org/r/620701

Something else I noticed while looking at icinga logs: check_nrpe for jessie hosts fails:

etcd1003;puppet last run;CRITICAL;SOFT;2;CHECK_NRPE: (ssl_err != 5) Error - Could not complete SSL handshake with 10.64.0.42: 1

Digging further this turns out to be too short dh params on the nrpe server side (from check_nrpe logs)

[1597742562] SSL Certificate File: None
[1597742562] SSL Private Key File: None
[1597742562] SSL CA Certificate File: None
[1597742562] SSL Cipher List: ALL:!MD5:@STRENGTH:@SECLEVEL=0
[1597742562] SSL Allow ADH: 1
[1597742562] SSL Log Options: 0xff
[1597742562] SSL Version: TLSv1_plus And Above
[1597742562] Connected to 10.64.0.42
[1597742562] Error: (ERR_get_error_line_data = 337260938), Could not complete SSL handshake with 10.64.0.42: dh key too small
fgiunchedi updated the task description. (Show Details)Aug 18 2020, 9:46 AM

Change 620929 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: make sure update-etcd-mw-config-lastindex is enabled

https://gerrit.wikimedia.org/r/620929

Change 620701 merged by Filippo Giunchedi:
[operations/puppet@production] profile: refactor logmsgbot to follow Icinga failover

https://gerrit.wikimedia.org/r/620701

Change 620929 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: make sure update-etcd-mw-config-lastindex is enabled

https://gerrit.wikimedia.org/r/620929

herron updated the task description. (Show Details)Aug 26 2020, 6:02 PM
herron updated the task description. (Show Details)Sep 2 2020, 7:10 PM
herron updated the task description. (Show Details)Sep 3 2020, 4:10 PM
herron updated the task description. (Show Details)
herron updated the task description. (Show Details)Sep 15 2020, 7:43 PM

Change 629408 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: switch active server from icinga1001 to alert1001

https://gerrit.wikimedia.org/r/629408

Change 629415 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: point icinga CNAMEs to alert1001

https://gerrit.wikimedia.org/r/629415

Mentioned in SAL (#wikimedia-operations) [2020-09-23T16:00:44Z] <herron> switching icinga over from icinga1001 to alert1001 T247966

Change 629415 merged by Herron:
[operations/dns@master] dns: point icinga CNAMEs to alert1001

https://gerrit.wikimedia.org/r/629415

Change 629408 merged by Herron:
[operations/puppet@production] icinga: switch active server from icinga1001 to alert1001

https://gerrit.wikimedia.org/r/629408

herron updated the task description. (Show Details)Sep 23 2020, 4:21 PM
herron added a comment.EditedSep 23 2020, 4:25 PM

Alert1001 is now the active Icinga server. Meta monitoring for alert[12]001 has been enabled as well.

There may be a few alerts needing TLC due to the transition, either from service checks changing from host icinga1001 to alert1001 or due to other migration related issues. Please join me in keeping an eye out for these.