Today I have tested an upgrade of Zookeeper in labs, and it seems possible/doable to upgrade to buster one node at the time. This should unblock Druid reimages, if we accept to loose the druid cache on each reimaged node of course (that shouldn't be a big deal, since we have now clusters of 5 nodes).
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T234629 Move the Analytics infrastructure to Debian Buster | |||
Resolved | elukey | T253980 Upgrade Druid to Debian Buster |
Event Timeline
Change 599770 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid1001
Change 599770 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid1001
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1001.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202005291219_elukey_47022_druid1001_eqiad_wmnet.log.
12:39:38 | druid1001.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: 0004000000 Boot Flags : - Boot Flag Invalid - Options apply to only next boot - BIOS PC Compatible (legacy) boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST
Change 599829 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add specific overrides for the zookeeper version on druid1001
Change 599829 merged by Elukey:
[operations/puppet@production] Add specific overrides for the zookeeper version on druid1001
Change 599846 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Force Java 11 JRE for zookeeper on druid1001
Change 599846 merged by Elukey:
[operations/puppet@production] Force Java 11 JRE for zookeeper on druid1001
Change 602016 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid100[2,3]
Change 602016 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid100[2,3]
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1002.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006030849_elukey_57135_druid1002_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['druid1002.eqiad.wmnet']
Of which those FAILED:
['druid1002.eqiad.wmnet']
Change 602035 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add zookeeper overrides for druid1002
Change 602035 merged by Elukey:
[operations/puppet@production] Add zookeeper overrides for druid1002
Change 602095 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: upgrade druid1003's settings for Buster
Change 602095 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade druid1003's settings for Buster
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1003.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006040513_elukey_217371_druid1003_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['druid1003.eqiad.wmnet']
Of which those FAILED:
['druid1003.eqiad.wmnet']
05:33:10 | druid1003.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: 0004000000 Boot Flags : - Boot Flag Invalid - Options apply to only next boot - BIOS PC Compatible (legacy) boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST
Interesting: the last bit of reimage failed for:
05:33:42 | cumin1001.eqiad.wmnet | Puppet run completed 05:33:42 | druid1003.eqiad.wmnet | Rebooted host 05:36:20 | druid1003.eqiad.wmnet | Uptime checked 05:36:20 | druid1003.eqiad.wmnet | Host up 05:36:20 | druid1003.eqiad.wmnet | Polling the completion of a Puppet run 05:37:23 | druid1003.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP address '2620:0:861:108:10:64:53:103' to the list of known hosts.\n1591249036" 05:37:23 | druid1003.eqiad.wmnet | REIMAGE END | retcode=2
Cc @Volans
Change 602309 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::analytics::worker: add java 11
Change 602310 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1004 for Debian Buster
Change 602309 merged by Elukey:
[operations/puppet@production] profile::druid::analytics::worker: add java 11
Change 602310 merged by Elukey:
[operations/puppet@production] Prepare druid1004 for Debian Buster
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1004.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006041056_elukey_33622_druid1004_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['druid1004.eqiad.wmnet']
Of which those FAILED:
['druid1004.eqiad.wmnet']
Change 602340 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster for druid100[4,5,6]
Change 602340 merged by Elukey:
[operations/puppet@production] Set Debian Buster for druid100[4,5,6]
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1004.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006041132_elukey_39926_druid1004_eqiad_wmnet.log.
Change 602552 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1005 for reimage
Change 602552 merged by Elukey:
[operations/puppet@production] Prepare druid1005 for reimage
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1005.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006050603_elukey_182384_druid1005_eqiad_wmnet.log.
@elukey thanks for bringing it to my attention, that's interesting.
So the reason it failed is that there is the warning message together with the real output of the command as you can see in:
Warning: Permanently added the ECDSA host key for IP address '2620:0:861:108:10:64:53:103' to the list of known hosts. 1591249036
As if cumin1001 was not having the fingerprint of druid1003, despite the fact that we force a puppet run on the cumin host before reboot exactly for this reason and the logs show that at 05:33:29 the known host file was updated with:
Jun 4 05:33:29 cumin1001 puppet-agent[219296]: (/Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content) -druid1003.eqiad.wmnet,druid1003,10.64.53.103,2620:0:861:108:10:64:53:103 ecdsa-sha2-nistp256 [...OMIT...] Jun 4 05:33:29 cumin1001 puppet-agent[219296]: (/Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content) +druid1003.eqiad.wmnet,druid1003,10.64.53.103,2620:0:861:108:1e98:ecff:fe29:e278 ecdsa-sha2-nistp256 [...OMIT...]
I can't find from the logs a reasonable explanation right now of why it happened.
And as soon as I pressed submit I actually noticed it... the problem is the IPv6 address.
It seems that the address exported to PuppetDB was not the v4 address mapped to v6 and so the known host file was updated with 2620:0:861:108:1e98:ecff:fe29:e278 and only later on at 06:05:06 (netx puppet run) it was updated to match 2620:0:861:108:10:64:53:103.
Change 602676 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare druid1006 for Debian Buster
Change 602676 merged by Elukey:
[operations/puppet@production] Prepare druid1006 for Debian Buster
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
druid1006.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202006051300_elukey_150570_druid1006_eqiad_wmnet.log.