upgrade icinga server to stretch and replace einsteinium
Closed, ResolvedPublic

Description

The production Icinga server (alerting_host) is einsteinium. The hardware needs to be replaced and it is running jessie.

This task consists of:

  • provide replacement hardware, called icinga1001 [T201344]
  • ensure the current alerting_host puppet role works on stretch, fix it if it doesn't (cloud VPS)
  • make changes to the role so that it can be applied on the new host in eqiad without being the "active" host to avoid duplicate alerts and notification spam
  • optional but very nice to have (otherwise we have to override jenkins style checks): refactor the icinga role to a profile, move the Hiera calls to parameters
  • apply role on icinga1001, confirm puppet is fine, if not fix things
  • confirm web UI works, test login, user privileges, sending email out as notification
  • confirm all checks are green (or like in prod) , without having notifications enabled
  • schedule switchover from einsteinium to icinga1001
  • switch over and disable notifications permanently on einsteinium but leave it running for a grace period (?)

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 469320 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: logging optimizations

https://gerrit.wikimedia.org/r/469320

Mentioned in SAL (#wikimedia-operations) [2018-10-23T21:30:02Z] <mutante> icinga1001 - changing check_result_reaper_frequecy from 10 to 3, trying to lower average check latency. "allow faster check result processing -> requires more CPU" (T202782)

Mentioned in SAL (#wikimedia-operations) [2018-10-23T21:47:28Z] <mutante> icinga1001 - replacing check_ping with check_fping as the standard host check command, for faster host checks (another tip from Nagios Tuning guide, still manual testing) (T202782)

Change 469333 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: use fping instead of ping for faster host checks

https://gerrit.wikimedia.org/r/469333

Change 469253 merged by Dzahn:
[operations/puppet@production] icinga: allow configuring max_concurrent_checks via Hiera

https://gerrit.wikimedia.org/r/469253

Change 469317 merged by Dzahn:
[operations/puppet@production] icinga: don't log service/host check retries

https://gerrit.wikimedia.org/r/469317

Change 467017 abandoned by Dzahn:
icinga/etcd: /var/run/icinga/ -> /var/run/nagios/

Reason:
done in https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /468414/ instead

https://gerrit.wikimedia.org/r/467017

Change 469780 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] icinga: install nsca_frack.cfg in objects on stretch

https://gerrit.wikimedia.org/r/469780

Change 469780 merged by Dzahn:
[operations/puppet@production] icinga: install nsca_frack.cfg in objects on stretch

https://gerrit.wikimedia.org/r/469780

Change 469808 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch

https://gerrit.wikimedia.org/r/469808

Change 469808 merged by Dzahn:
[operations/puppet@production] icinga/nsca: fix nsca_chroot path on stretch

https://gerrit.wikimedia.org/r/469808

Change 469988 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/nsca: fix command_file path on stretch

https://gerrit.wikimedia.org/r/469988

Change 469988 merged by Dzahn:
[operations/puppet@production] icinga/nsca: fix command_file path on stretch

https://gerrit.wikimedia.org/r/469988

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

icinga1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201810262131_dzahn_122415_icinga1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['icinga1001.wikimedia.org']

Of which those FAILED:

['icinga1001.wikimedia.org']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

icinga1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201810262133_dzahn_122885_icinga1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['icinga1001.wikimedia.org']

Of which those FAILED:

['icinga1001.wikimedia.org']

Mentioned in SAL (#wikimedia-operations) [2018-10-27T00:00:06Z] <mutante> icinga1001 - using wmf-auto-reimage to reinstall gets stuck at initial puppet run after reboot - Still waiting for Puppet after 105.0 minutes - aborting on cumin, loggin in directly and manually running puppet (T202782 T208100)

Change 470106 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/alerting_host: ensure mpm_prefork is selected for httpd and php7.0

https://gerrit.wikimedia.org/r/470106

Change 470106 merged by Dzahn:
[operations/puppet@production] icinga/alerting_host: ensure mpm_prefork is selected for httpd and php7.0

https://gerrit.wikimedia.org/r/470106

Mentioned in SAL (#wikimedia-operations) [2018-10-27T00:00:06Z] <mutante> icinga1001 - using wmf-auto-reimage to reinstall gets stuck at initial puppet run after reboot - Still waiting for Puppet after 105.0 minutes - aborting on cumin, loggin in directly and manually running puppet (T202782 T208100)

What got stuck?
The first puppet run can easily take up to 2h of time based on the host role and server capabilities. You can check its progress tailing the cumin log whose path is listed at the start of the reimage process to stdout (the 2nd of the two logs listed).
To check further what's going on directly in the host you could also login either with the install_console script or your personal key (based where the host is in the puppetization process) or, worse case scenario you can't with both, via the mgmt console as root.

Change 469320 merged by Dzahn:
[operations/puppet@production] icinga: logging optimizations

https://gerrit.wikimedia.org/r/469320

Change 467015 merged by Dzahn:
[operations/puppet@production] etcd::monitoring: replace Nagios::Plugin with Monitoring::Plugin on stretch

https://gerrit.wikimedia.org/r/467015

Change 470915 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: don't install PHP httpd module on stretch

https://gerrit.wikimedia.org/r/470915

Change 470915 merged by Dzahn:
[operations/puppet@production] icinga: don't install PHP httpd module on stretch

https://gerrit.wikimedia.org/r/470915

Change 470955 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: on stretch, don't use custom logrotate config

https://gerrit.wikimedia.org/r/470955

Change 470955 merged by Dzahn:
[operations/puppet@production] icinga: on stretch, don't use custom logrotate config

https://gerrit.wikimedia.org/r/470955

Dzahn added a comment.EditedNov 1 2018, 12:32 AM

What got stuck?

It never detected a succesful run because it had this error: T208108

which i fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470106/

I will reinstall one more time to double check there was nothing else.

Logging in wasn't the problem, it was already after creating all the users.

Change 462600 merged by Dzahn:
[operations/puppet@production] icinga: on stretch, use systemd::service, unit file by systemd-sysv-generator

https://gerrit.wikimedia.org/r/462600

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

icinga1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201811030258_dzahn_163438_icinga1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['icinga1001.wikimedia.org']

and were ALL successful.

Change 471866 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] nagios_common: ensure libtimedate-perl is installed for check_ssl

https://gerrit.wikimedia.org/r/471866

Change 471866 merged by Dzahn:
[operations/puppet@production] nagios_common: ensure libtimedate-perl is installed for check_ssl

https://gerrit.wikimedia.org/r/471866

Change 471873 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: fix ownership of command file dir on stretch

https://gerrit.wikimedia.org/r/471873

Change 471873 merged by Dzahn:
[operations/puppet@production] icinga: fix ownership of command file dir on stretch

https://gerrit.wikimedia.org/r/471873

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

icinga1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201811060134_dzahn_77047_icinga1001_wikimedia_org.log.

Change 471900 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mailman: whitelist icinga1001 IP alongside einsteinium and tegmen

https://gerrit.wikimedia.org/r/471900

Completed auto-reimage of hosts:

['icinga1001.wikimedia.org']

Of which those FAILED:

['icinga1001.wikimedia.org']

Change 465519 merged by Dzahn:
[operations/puppet@production] base/icinga: use monitoring_hosts constant as NRPE allowed_hosts

https://gerrit.wikimedia.org/r/465519

Change 472006 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base/NRPE: make nrpe_allowed_hosts flexible on $realm

https://gerrit.wikimedia.org/r/472006

Change 472006 merged by Dzahn:
[operations/puppet@production] base/NRPE: make nrpe_allowed_hosts flexible on $realm

https://gerrit.wikimedia.org/r/472006

Change 471900 merged by Dzahn:
[operations/puppet@production] mailman: whitelist icinga1001 IP alongside einsteinium and tegmen

https://gerrit.wikimedia.org/r/471900

Mentioned in SAL (#wikimedia-operations) [2018-11-06T21:50:34Z] <mutante> icinga1001-"MediaWiki EtcdConfig up-to-date" checks were all UNKNOWN because systemd unit update-etcd-mw-config-lastindex was present but service not running. it was turned off in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/427328/ on purpose. manually ran "systemctl start update-etcd-mw-config-lastindex" and the checks all work (T202782)

Change 472352 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: fix path to retention.dat file on stretch

https://gerrit.wikimedia.org/r/472352

Change 472352 merged by Dzahn:
[operations/puppet@production] icinga: fix path to retention.dat file on stretch

https://gerrit.wikimedia.org/r/472352

Change 472519 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] icinga: add tmpfs options to stretch

https://gerrit.wikimedia.org/r/472519

Change 472519 merged by Dzahn:
[operations/puppet@production] icinga: add tmpfs options to stretch

https://gerrit.wikimedia.org/r/472519

jijiki added a subscriber: jijiki.Nov 12 2018, 4:35 PM

Change 473235 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch icinga-stretch to icinga and icinga to icinga-old

https://gerrit.wikimedia.org/r/473235

Dzahn updated the task description. (Show Details)Nov 13 2018, 3:54 PM

Change 473244 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: make icinga1001 active and einsteinium passive

https://gerrit.wikimedia.org/r/473244

Change 473244 merged by Dzahn:
[operations/puppet@production] icinga: make icinga1001 active and einsteinium passive

https://gerrit.wikimedia.org/r/473244

Change 473235 merged by Dzahn:
[operations/dns@master] switch icinga-stretch to icinga and icinga to icinga-old

https://gerrit.wikimedia.org/r/473235

Change 473250 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] icinga: disable service notifications on einsteinium and enable on icinga1001

https://gerrit.wikimedia.org/r/473250

Change 473251 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] icinga: transition to new active host

https://gerrit.wikimedia.org/r/473251

Change 473250 merged by Dzahn:
[operations/puppet@production] icinga: disable service notifications on einsteinium and enable on icinga1001

https://gerrit.wikimedia.org/r/473250

Change 473251 merged by Cwhite:
[operations/puppet@production] icinga: transition to new active host

https://gerrit.wikimedia.org/r/473251

Mentioned in SAL (#wikimedia-operations) [2018-11-13T18:03:04Z] <mutante> icinga migration has concluded, we are now on stretch and icinga1001, einsteinium is passive (T202782)

Change 473261 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: fix path to puppet_hosts in icinga-downtime

https://gerrit.wikimedia.org/r/473261

Change 473261 merged by Dzahn:
[operations/puppet@production] icinga: fix path to puppet_hosts in icinga-downtime

https://gerrit.wikimedia.org/r/473261

Dzahn updated the task description. (Show Details)Nov 13 2018, 7:40 PM

The switch has happened. icinga1001 is now the active server and einsteinium is passive. This ticket still open for a grace period until we decom einsteinium.

Change 473275 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: enable daemon log, in addition to syslog, again

https://gerrit.wikimedia.org/r/473275

Change 473276 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Change 473275 merged by Dzahn:
[operations/puppet@production] icinga: enable daemon log, in addition to syslog, again

https://gerrit.wikimedia.org/r/473275

Change 473282 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: replace target einsteinium with dns1002

https://gerrit.wikimedia.org/r/473282

Change 473282 merged by Dzahn:
[operations/puppet@production] smokeping: replace target einsteinium with authdns1001

https://gerrit.wikimedia.org/r/473282

Dzahn changed the status of subtask T209738: decom einsteinium from Open to Stalled.Nov 16 2018, 11:58 PM
Dzahn updated the task description. (Show Details)Nov 17 2018, 12:00 AM
Dzahn updated the task description. (Show Details)
Dzahn closed this task as Resolved.

this ticket is resolved, einsteinium has been replaced by icinga1001 on stretch.

the rest of the steps will be part of the decom ticket now created at T209738 with the official decom checkboxes (and i added the one to remove from AQL that was here previously)

Change 473278 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove einsteinium as an alerting_host

https://gerrit.wikimedia.org/r/473278

Change 469333 abandoned by Dzahn:
icinga: on stretch, use fping instead of ping for faster host checks

https://gerrit.wikimedia.org/r/469333

Change 474463 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: fix path to puppet_hosts/services in default_icinga.sh

https://gerrit.wikimedia.org/r/474463

Change 474463 merged by Dzahn:
[operations/puppet@production] icinga: fix path to puppet_hosts/services in default_icinga.sh

https://gerrit.wikimedia.org/r/474463

Change 474464 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: do not manage retention.dat in puppet

https://gerrit.wikimedia.org/r/474464

Change 474464 merged by Dzahn:
[operations/puppet@production] icinga: do not manage retention.dat in puppet

https://gerrit.wikimedia.org/r/474464

Change 473278 merged by Dzahn:
[operations/puppet@production] icinga: remove einsteinium as an alerting_host

https://gerrit.wikimedia.org/r/473278

Change 475881 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/NSCA (passive checks): remove jessie support hacks

https://gerrit.wikimedia.org/r/475881

Change 475881 merged by Dzahn:
[operations/puppet@production] icinga/NSCA (passive checks): remove jessie support hacks

https://gerrit.wikimedia.org/r/475881

Change 475901 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga::web: do not use PHP anymore

https://gerrit.wikimedia.org/r/475901

Change 475901 merged by Dzahn:
[operations/puppet@production] icinga::web: do not use PHP anymore

https://gerrit.wikimedia.org/r/475901

Change 475927 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: stop having separate configs for jessie/stretch

https://gerrit.wikimedia.org/r/475927

Change 475927 merged by Dzahn:
[operations/puppet@production] icinga: stop having separate configs for jessie/stretch

https://gerrit.wikimedia.org/r/475927

Change 473276 merged by Dzahn:
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Dzahn changed the status of subtask T209738: decom einsteinium from Stalled to Open.Tue, Nov 27, 12:59 AM