Page MenuHomePhabricator

Upgrade Druid nodes (1001->1006) to Debian Stretch
Closed, ResolvedPublic8 Estimated Story Points

Description

Two things would be needed to allow this:

  1. zookeeper upgraded to 3.4.9 (not strictly necessary but safer since the jessie version is 3.4.5).
  2. druid upgraded to 0.10+ (in order to have openjdk-8 only) - T164008

The procedure then should be really simple:

  1. depool (if needed) one host at the time
  2. reimage it
  3. check that everything is fine and proceed with another one

Druid segments that are cached would need to be restored, so leaving enough time between one host and the other would probably be better.

  • druid1001
  • druid1002
  • druid1003
  • druid1004
  • druid1005
  • druid1006

Event Timeline

fdans triaged this task as Medium priority.Apr 23 2018, 3:54 PM
fdans raised the priority of this task from Medium to High.
fdans lowered the priority of this task from High to Medium.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

All druid nodes are running Druid 0.10 and zookeeper 3.4.9, we can do the work anytime.

Change 434634 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::*: move prometheus jmx exp's config to /etc/prometheus

https://gerrit.wikimedia.org/r/434634

Change 434634 merged by Elukey:
[operations/puppet@production] profile::druid::*: move prometheus jmx exp's config to /etc/prometheus

https://gerrit.wikimedia.org/r/434634

I am upgrading the Druid labs cluster to Stretch as test, and two things came up:

  1. the druid debs were missing from stretch-wikimedia, uploaded them.
  2. as it happened with Hadoop, the dependency between the prometheus agent saving config files under /etc/druid and the fact that druid packages are the ones creating the directory created some race conditions during the first puppet run. I moved the config files to /etc/prometheus and explicitly required the prometheus jmx agent to be installed before druid, in order to avoid the race condition completely. Prod will pick up the change during the next restart/reimage/upgrade/etc..

Change 434870 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set druid* PXE boot to Debian Stretch

https://gerrit.wikimedia.org/r/434870

Change 434870 merged by Elukey:
[operations/puppet@production] Set druid* PXE boot to Debian Stretch

https://gerrit.wikimedia.org/r/434870

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805280657_elukey_19222.log.

Completed auto-reimage of hosts:

['druid1003.eqiad.wmnet']

Of which those FAILED:

['druid1003.eqiad.wmnet']

Change 435727 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Override druid1003's zookeeper version after reimage

https://gerrit.wikimedia.org/r/435727

Change 435727 merged by Elukey:
[operations/puppet@production] Override druid1003's zookeeper version after reimage

https://gerrit.wikimedia.org/r/435727

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805290749_elukey_1188.log.

Change 435970 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set zookeeper verion for druid1002

https://gerrit.wikimedia.org/r/435970

Change 435970 merged by Elukey:
[operations/puppet@production] Set zookeeper verion for druid1002

https://gerrit.wikimedia.org/r/435970

Completed auto-reimage of hosts:

['druid1002.eqiad.wmnet']

and were ALL successful.

Change 435983 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics|public::worker: remove hadoop-cdh client dep

https://gerrit.wikimedia.org/r/435983

Change 435983 merged by Elukey:
[operations/puppet@production] role::druid::analytics|public::worker: remove hadoop-cdh client dep

https://gerrit.wikimedia.org/r/435983

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805300617_elukey_16317.log.

Change 436220 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: set zookeeper version to 3.4.9-3

https://gerrit.wikimedia.org/r/436220

Change 436220 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: set zookeeper version to 3.4.9-3

https://gerrit.wikimedia.org/r/436220

Completed auto-reimage of hosts:

['druid1001.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805300846_elukey_15355.log.

Change 436236 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Override zookeeper version for druid1004

https://gerrit.wikimedia.org/r/436236

Change 436236 merged by Elukey:
[operations/puppet@production] Override zookeeper version for druid1004

https://gerrit.wikimedia.org/r/436236

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805310559_elukey_21502.log.

Completed auto-reimage of hosts:

['druid1005.eqiad.wmnet']

Of which those FAILED:

['druid1005.eqiad.wmnet']

Change 436450 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Override druid1005's zookeeper settings

https://gerrit.wikimedia.org/r/436450

Change 436450 merged by Elukey:
[operations/puppet@production] Override druid1005's zookeeper settings

https://gerrit.wikimedia.org/r/436450

Change 436487 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public: set zookeeper version to 3.4.9-3

https://gerrit.wikimedia.org/r/436487

Change 436487 merged by Elukey:
[operations/puppet@production] role::druid::public: set zookeeper version to 3.4.9-3

https://gerrit.wikimedia.org/r/436487

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805310934_elukey_9600.log.

Completed auto-reimage of hosts:

['druid1006.eqiad.wmnet']

and were ALL successful.

elukey set the point value for this task to 8.May 31 2018, 10:04 AM
elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.