Page MenuHomePhabricator

Upgrade Druid to Debian Buster
Closed, ResolvedPublic8 Estimated Story Points

Description

Today I have tested an upgrade of Zookeeper in labs, and it seems possible/doable to upgrade to buster one node at the time. This should unblock Druid reimages, if we accept to loose the druid cache on each reimaged node of course (that shouldn't be a big deal, since we have now clusters of 5 nodes).

Related Objects

Event Timeline

elukey triaged this task as Medium priority.

Change 599770 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid1001

https://gerrit.wikimedia.org/r/599770

Change 599770 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid1001

https://gerrit.wikimedia.org/r/599770

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005291219_elukey_47022_druid1001_eqiad_wmnet.log.

12:39:38 | druid1001.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Change 599829 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add specific overrides for the zookeeper version on druid1001

https://gerrit.wikimedia.org/r/599829

Change 599829 merged by Elukey:
[operations/puppet@production] Add specific overrides for the zookeeper version on druid1001

https://gerrit.wikimedia.org/r/599829

Completed auto-reimage of hosts:

['druid1001.eqiad.wmnet']

and were ALL successful.

Change 599846 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Force Java 11 JRE for zookeeper on druid1001

https://gerrit.wikimedia.org/r/599846

Change 599846 merged by Elukey:
[operations/puppet@production] Force Java 11 JRE for zookeeper on druid1001

https://gerrit.wikimedia.org/r/599846

Change 602016 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid100[2,3]

https://gerrit.wikimedia.org/r/602016

Change 602016 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid100[2,3]

https://gerrit.wikimedia.org/r/602016

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006030849_elukey_57135_druid1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1002.eqiad.wmnet']

Of which those FAILED:

['druid1002.eqiad.wmnet']

Change 602035 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add zookeeper overrides for druid1002

https://gerrit.wikimedia.org/r/602035

Change 602035 merged by Elukey:
[operations/puppet@production] Add zookeeper overrides for druid1002

https://gerrit.wikimedia.org/r/602035

Change 602095 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: upgrade druid1003's settings for Buster

https://gerrit.wikimedia.org/r/602095

Change 602095 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade druid1003's settings for Buster

https://gerrit.wikimedia.org/r/602095

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006040513_elukey_217371_druid1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1003.eqiad.wmnet']

Of which those FAILED:

['druid1003.eqiad.wmnet']
05:33:10 | druid1003.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Interesting: the last bit of reimage failed for:

05:33:42 | cumin1001.eqiad.wmnet | Puppet run completed
05:33:42 | druid1003.eqiad.wmnet | Rebooted host
05:36:20 | druid1003.eqiad.wmnet | Uptime checked
05:36:20 | druid1003.eqiad.wmnet | Host up
05:36:20 | druid1003.eqiad.wmnet | Polling the completion of a Puppet run
05:37:23 | druid1003.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP address '2620:0:861:108:10:64:53:103' to the list of known hosts.\n1591249036"
05:37:23 | druid1003.eqiad.wmnet | REIMAGE END | retcode=2

Cc @Volans

Change 602309 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::analytics::worker: add java 11

https://gerrit.wikimedia.org/r/602309

Change 602310 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1004 for Debian Buster

https://gerrit.wikimedia.org/r/602310

Change 602309 merged by Elukey:
[operations/puppet@production] profile::druid::analytics::worker: add java 11

https://gerrit.wikimedia.org/r/602309

Change 602310 merged by Elukey:
[operations/puppet@production] Prepare druid1004 for Debian Buster

https://gerrit.wikimedia.org/r/602310

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006041056_elukey_33622_druid1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

Of which those FAILED:

['druid1004.eqiad.wmnet']

Change 602340 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid100[4,5,6]

https://gerrit.wikimedia.org/r/602340

Change 602340 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid100[4,5,6]

https://gerrit.wikimedia.org/r/602340

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006041132_elukey_39926_druid1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

and were ALL successful.

Change 602552 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1005 for reimage

https://gerrit.wikimedia.org/r/602552

Change 602552 merged by Elukey:
[operations/puppet@production] Prepare druid1005 for reimage

https://gerrit.wikimedia.org/r/602552

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1005.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006050603_elukey_182384_druid1005_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1005.eqiad.wmnet']

and were ALL successful.

Interesting: the last bit of reimage failed for:

05:33:42 | cumin1001.eqiad.wmnet | Puppet run completed
05:33:42 | druid1003.eqiad.wmnet | Rebooted host
05:36:20 | druid1003.eqiad.wmnet | Uptime checked
05:36:20 | druid1003.eqiad.wmnet | Host up
05:36:20 | druid1003.eqiad.wmnet | Polling the completion of a Puppet run
05:37:23 | druid1003.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP address '2620:0:861:108:10:64:53:103' to the list of known hosts.\n1591249036"
05:37:23 | druid1003.eqiad.wmnet | REIMAGE END | retcode=2

Cc @Volans

@elukey thanks for bringing it to my attention, that's interesting.

So the reason it failed is that there is the warning message together with the real output of the command as you can see in:

Warning: Permanently added the ECDSA host key for IP address '2620:0:861:108:10:64:53:103' to the list of known hosts.
1591249036

As if cumin1001 was not having the fingerprint of druid1003, despite the fact that we force a puppet run on the cumin host before reboot exactly for this reason and the logs show that at 05:33:29 the known host file was updated with:

Jun  4 05:33:29 cumin1001 puppet-agent[219296]: (/Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content) -druid1003.eqiad.wmnet,druid1003,10.64.53.103,2620:0:861:108:10:64:53:103 ecdsa-sha2-nistp256 [...OMIT...]
Jun  4 05:33:29 cumin1001 puppet-agent[219296]: (/Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content) +druid1003.eqiad.wmnet,druid1003,10.64.53.103,2620:0:861:108:1e98:ecff:fe29:e278 ecdsa-sha2-nistp256 [...OMIT...]

I can't find from the logs a reasonable explanation right now of why it happened.

I can't find from the logs a reasonable explanation right now of why it happened.

And as soon as I pressed submit I actually noticed it... the problem is the IPv6 address.
It seems that the address exported to PuppetDB was not the v4 address mapped to v6 and so the known host file was updated with 2620:0:861:108:1e98:ecff:fe29:e278 and only later on at 06:05:06 (netx puppet run) it was updated to match 2620:0:861:108:10:64:53:103.

Change 602676 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1006 for Debian Buster

https://gerrit.wikimedia.org/r/602676

Change 602676 merged by Elukey:
[operations/puppet@production] Prepare druid1006 for Debian Buster

https://gerrit.wikimedia.org/r/602676

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

druid1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006051300_elukey_150570_druid1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['druid1006.eqiad.wmnet']

and were ALL successful.

elukey set the point value for this task to 8.Jun 5 2020, 1:36 PM
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.
Milimetric subscribed.

we got a buster cluster!