Page MenuHomePhabricator

Install Debian Buster on Hadoop
Closed, ResolvedPublic0 Estimated Story Points

Description

The upgrade to Debian buster for the Hadoop cluster(s) might be a bit more complicated than what we thought, due to the fact that openjdk-8 is not available on Debian Buster. In T229347 Andrew was able to install it on stat1005 since the openjdk-8 was present in Buster before its final release, but not now (so if we reimage we'll not find it for example).

The above becomes problematic due to the following constraints:

  1. Spark 2.3 (our current version) doesn't support Java 11 (see also T229347#5394326). IIUC this is due to the Scala version used (2.11), that doesn't support Java 11 (https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html)
  2. Support of scala 2.12+ for Java 11 is still incomplete - https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html#jdk-11-compatibility-notes
  3. Spark 2.4 comes with scala 2.12 that offers experimental support for Java 11

Also, in stretch-backports we do have openjdk-11: https://packages.debian.org/stretch-backports/openjdk-11-jdk
Last but not the least, we'd also need to make sure that the HDFS/Yarn daemons work correctly on Buster and Java 11. CDH of course supports Java11 only from 6.3 onward: https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_jdk8.html

But it also true that CDH 6.3 ships with Spark 2.4, so either they support Java 11 as experimental feature or there is a way to make Spark 2.4 working: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_63_packaging.html

Considerations:

  • I am not a scala/spark expert so what I wrote above might not be true, please double check and in case correct me :)
  • backporting openjdk-8 to buster is possible but it would require a big effort for the SRE team. The last backport of openjdk-8 for cassandra on Debian Jessie still needs to be maintained (application of patches for Debian Security Advisories, etc..), so it would be preferable not to go on that road again.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+9 -0
operations/puppetproduction+1 -11
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+6 -1
operations/puppetproduction+3 -0
operations/puppetproduction+3 -3
operations/puppetproduction+14 -1
operations/puppetproduction+2 -2
operations/puppetproduction+78 -6
operations/puppetproduction+10 -0
operations/puppetproduction+49 -4
operations/puppetproduction+82 -0
operations/puppetproduction+16 -4
operations/puppetproduction+17 -6
operations/puppetproduction+10 -1
operations/puppetproduction+0 -81
operations/puppetproduction+1 -0
operations/puppetproduction+26 -4
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 667668 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker1098 as Hadoop GPU Buster worker

https://gerrit.wikimedia.org/r/667668

Change 667668 merged by Elukey:
[operations/puppet@production] Set an-worker1098 as Hadoop GPU Buster worker

https://gerrit.wikimedia.org/r/667668

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1098.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103011740_elukey_17551.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1098.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103011755_elukey_19122.log.

Completed auto-reimage of hosts:

['an-worker1098.eqiad.wmnet']

and were ALL successful.

Pausing this for a few days to let the MW history jobs to complete :)

Change 668085 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update GPU settings for Hadoop workers to ROCm 3.8

https://gerrit.wikimedia.org/r/668085

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet', 'an-worker1101.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103031420_elukey_26330.log.

Change 668085 merged by Elukey:
[operations/puppet@production] Update GPU settings for Hadoop workers to ROCm 3.8

https://gerrit.wikimedia.org/r/668085

Completed auto-reimage of hosts:

['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet', 'an-worker1101.eqiad.wmnet']

and were ALL successful.

Change 668106 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::worker: set linux 5.10 on GPU workers

https://gerrit.wikimedia.org/r/668106

Change 668106 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::worker: set linux 5.10 on GPU workers

https://gerrit.wikimedia.org/r/668106

Change 668337 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::worker: add gpu-users

https://gerrit.wikimedia.org/r/668337

Change 668337 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::worker: add gpu-users

https://gerrit.wikimedia.org/r/668337

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /etc/debian_version'
78 hosts will be targeted:
an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                    
(26) an-worker[1096-1101,1117-1128,1130-1132,1135-1138].eqiad.wmnet,analytics1058.eqiad.wmnet                             
----- OUTPUT of 'cat /etc/debian_version' -----                                                                           
10.8                                                                                                                      
===== NODE GROUP =====                                                                                                    
(52) an-worker[1078-1095,1102-1116].eqiad.wmnet,analytics[1059-1077].eqiad.wmnet                                          
----- OUTPUT of 'cat /etc/debian_version' -----                                                                           
9.13

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1059.eqiad.wmnet', 'analytics1060.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041047_elukey_17965.log.

Completed auto-reimage of hosts:

['analytics1060.eqiad.wmnet', 'analytics1059.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1061.eqiad.wmnet', 'analytics1062.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041254_elukey_8281.log.

Completed auto-reimage of hosts:

['analytics1061.eqiad.wmnet', 'analytics1062.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1063.eqiad.wmnet', 'analytics1064.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041339_elukey_21178.log.

Completed auto-reimage of hosts:

['analytics1063.eqiad.wmnet', 'analytics1064.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1065.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041426_elukey_4245.log.

Completed auto-reimage of hosts:

['analytics1065.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1066.eqiad.wmnet', 'analytics1067.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041511_elukey_15956.log.

Completed auto-reimage of hosts:

['analytics1066.eqiad.wmnet', 'analytics1067.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1068.eqiad.wmnet', 'analytics1069.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103041640_elukey_17903.log.

Completed auto-reimage of hosts:

['analytics1068.eqiad.wmnet', 'analytics1069.eqiad.wmnet']

and were ALL successful.

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /etc/debian_version'
78 hosts will be targeted:
an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                        
(37) an-worker[1096-1101,1117-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1069].eqiad.wmnet                          
----- OUTPUT of 'cat /etc/debian_version' -----                                                                               
10.8                                                                                                                          
===== NODE GROUP =====                                                                                                        
(41) an-worker[1078-1095,1102-1116].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet                                              
----- OUTPUT of 'cat /etc/debian_version' -----                                                                               
9.13

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1070.eqiad.wmnet', 'analytics1071.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103050745_elukey_17225.log.

Completed auto-reimage of hosts:

['analytics1071.eqiad.wmnet', 'analytics1070.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1078.eqiad.wmnet', 'an-worker1079.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103050839_elukey_29542.log.

Completed auto-reimage of hosts:

['an-worker1079.eqiad.wmnet', 'an-worker1078.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1073.eqiad.wmnet', 'an-worker1086.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103051518_elukey_9071.log.

Completed auto-reimage of hosts:

['analytics1073.eqiad.wmnet', 'an-worker1086.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1074.eqiad.wmnet', 'analytics1075.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103080731_elukey_14418.log.

Remaining nodes to reimage and their racking:

an-worker1080.eqiad.wmnet:  /eqiad/A/4

an-worker1081.eqiad.wmnet:  /eqiad/A/7
an-worker1082.eqiad.wmnet:  /eqiad/A/7
an-worker1103.eqiad.wmnet:  /eqiad/A/7

an-worker1083.eqiad.wmnet:  /eqiad/B/2
an-worker1084.eqiad.wmnet:  /eqiad/B/2

analytics1072.eqiad.wmnet:  /eqiad/B/3

an-worker1085.eqiad.wmnet:  /eqiad/B/4

an-worker1087.eqiad.wmnet:  /eqiad/B/7

an-worker1088.eqiad.wmnet:  /eqiad/C/2
an-worker1104.eqiad.wmnet:  /eqiad/C/2

an-worker1089.eqiad.wmnet:  /eqiad/C/4
an-worker1090.eqiad.wmnet:  /eqiad/C/4
an-worker1105.eqiad.wmnet:  /eqiad/C/4
an-worker1106.eqiad.wmnet:  /eqiad/C/4
an-worker1107.eqiad.wmnet:  /eqiad/C/4
an-worker1108.eqiad.wmnet:  /eqiad/C/4

an-worker1091.eqiad.wmnet:  /eqiad/C/7
an-worker1109.eqiad.wmnet:  /eqiad/C/7
an-worker1110.eqiad.wmnet:  /eqiad/C/7

an-worker1111.eqiad.wmnet:  /eqiad/C/8

analytics1076.eqiad.wmnet:  /eqiad/D/2
an-worker1092.eqiad.wmnet:  /eqiad/D/2
an-worker1093.eqiad.wmnet:  /eqiad/D/2
an-worker1112.eqiad.wmnet:  /eqiad/D/2

an-worker1113.eqiad.wmnet:  /eqiad/D/5
an-worker1114.eqiad.wmnet:  /eqiad/D/5

analytics1077.eqiad.wmnet:  /eqiad/D/7
an-worker1094.eqiad.wmnet:  /eqiad/D/7
an-worker1095.eqiad.wmnet:  /eqiad/D/7
an-worker1115.eqiad.wmnet:  /eqiad/D/7
an-worker1116.eqiad.wmnet:  /eqiad/D/7

Completed auto-reimage of hosts:

['analytics1074.eqiad.wmnet', 'analytics1075.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1081.eqiad.wmnet', 'an-worker1082.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103080827_elukey_28297.log.

Completed auto-reimage of hosts:

['an-worker1081.eqiad.wmnet', 'an-worker1082.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1083.eqiad.wmnet', 'an-worker1084.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103080928_elukey_9830.log.

Completed auto-reimage of hosts:

['an-worker1083.eqiad.wmnet', 'an-worker1084.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1088.eqiad.wmnet', 'an-worker1104.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081101_elukey_17596.log.

Completed auto-reimage of hosts:

['an-worker1088.eqiad.wmnet']

Of which those FAILED:

['an-worker1104.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1105.eqiad.wmnet', 'an-worker1106.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081358_elukey_20294.log.

Completed auto-reimage of hosts:

['an-worker1105.eqiad.wmnet', 'an-worker1106.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1107.eqiad.wmnet', 'an-worker1108.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081459_elukey_13319.log.

Completed auto-reimage of hosts:

['an-worker1107.eqiad.wmnet', 'an-worker1108.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1109.eqiad.wmnet', 'an-worker1110.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081624_elukey_30551.log.

Completed auto-reimage of hosts:

['an-worker1109.eqiad.wmnet', 'an-worker1110.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1113.eqiad.wmnet', 'an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081720_elukey_20752.log.

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet', 'an-worker1114.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1115.eqiad.wmnet', 'an-worker1116.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103081817_elukey_11088.log.

Completed auto-reimage of hosts:

['an-worker1115.eqiad.wmnet', 'an-worker1116.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1094.eqiad.wmnet', 'an-worker1095.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103090710_elukey_27457.log.

Completed auto-reimage of hosts:

['an-worker1094.eqiad.wmnet', 'an-worker1095.eqiad.wmnet']

and were ALL successful.

Remaining nodes:

an-worker[1080,1085,1087,1089-1093,1102-1103,1111-1112].eqiad.wmnet,analytics[1072,1076-1077].eqiad.wmnet

an-worker1080.eqiad.wmnet:  /eqiad/A/4   JN
an-worker1102.eqiad.wmnet:  /eqiad/A/4

an-worker1103.eqiad.wmnet:  /eqiad/A/7

analytics1072.eqiad.wmnet:  /eqiad/B/3      JN

an-worker1085.eqiad.wmnet:  /eqiad/B/4

an-worker1087.eqiad.wmnet:  /eqiad/B/7

an-worker1089.eqiad.wmnet:  /eqiad/C/4
an-worker1090.eqiad.wmnet:  /eqiad/C/4     JN

an-worker1091.eqiad.wmnet:  /eqiad/C/7

an-worker1111.eqiad.wmnet:  /eqiad/C/8

analytics1076.eqiad.wmnet:  /eqiad/D/2
an-worker1092.eqiad.wmnet:  /eqiad/D/2
an-worker1093.eqiad.wmnet:  /eqiad/D/2
an-worker1112.eqiad.wmnet:  /eqiad/D/2

analytics1077.eqiad.wmnet:  /eqiad/D/7

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1092.eqiad.wmnet', 'an-worker1093.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103090829_elukey_11602.log.

Completed auto-reimage of hosts:

['an-worker1092.eqiad.wmnet', 'an-worker1093.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1076.eqiad.wmnet', 'an-worker1112.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103090923_elukey_24862.log.

Completed auto-reimage of hosts:

['analytics1076.eqiad.wmnet', 'an-worker1112.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1103.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091251_elukey_18026.log.

Completed auto-reimage of hosts:

['an-worker1103.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1102.eqiad.wmnet', 'an-worker1080.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091334_elukey_29991.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet', 'an-worker1080.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1090.eqiad.wmnet', 'an-worker1089.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091428_elukey_18561.log.

Completed auto-reimage of hosts:

['an-worker1090.eqiad.wmnet', 'an-worker1089.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1072.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091530_elukey_18926.log.

Completed auto-reimage of hosts:

['analytics1072.eqiad.wmnet']

and were ALL successful.

(5) an-worker[1085,1087,1091,1111].eqiad.wmnet,analytics1077.eqiad.wmnet                                                          
----- OUTPUT of 'cat /etc/debian_version' -----                                                                                   
9.13

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091646_elukey_15244.log.

Completed auto-reimage of hosts:

['analytics1077.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1085.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091739_elukey_26525.log.

Completed auto-reimage of hosts:

['an-worker1085.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1087.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091833_elukey_3899.log.

Completed auto-reimage of hosts:

['an-worker1087.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1091.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103092008_elukey_23450.log.

Completed auto-reimage of hosts:

['an-worker1091.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1111.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103100627_elukey_2055.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

and were ALL successful.

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /etc/debian_version'
78 hosts will be targeted:
an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                             
(78) an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet                         
----- OUTPUT of 'cat /etc/debian_version' -----                                                                    
10.8

Next steps:

  • Complete the hadoop test upgrade (one worker remaining + masters)
  • Upgrade furud/flerovium
  • Upgrade hadoop masters
  • Upgrade hadoop coordinators (complicated, requires a failover)

I am going to open subtasks for all the above.

@razzi, FYI in ops sync today we decided that you could drive a few of these upgrade tasks in Q4, while Luca would drive the Hadoop coordinator node one. I've assigned them accordingly. :)