Page MenuHomePhabricator

Reimage the Hadoop Cluster to Debian Jessie
Closed, ResolvedPublic8 Estimated Story Points

Description

Tracking task to the reimage of all the Hadoop worker nodes

Event Timeline

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703131200_elukey_16132.log.

Change 342448 had a related patch set uploaded (by Elukey):
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes

https://gerrit.wikimedia.org/r/342448

Change 342448 merged by Elukey:
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes

https://gerrit.wikimedia.org/r/342448

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703131226_elukey_8812.log.

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Nuria raised the priority of this task from Medium to High.Mar 13 2017, 3:41 PM
Nuria moved this task from Incoming to Wikistats on the Analytics board.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1043.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703141240_elukey_32343.log.

Completed auto-reimage of hosts:

['analytics1043.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1044.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703171152_elukey_26003.log.

Completed auto-reimage of hosts:

['analytics1044.eqiad.wmnet']

and were ALL successful.

We found a weird regression only on analytics1044, causing sporadic job failures:

2017-03-27 14:06:02,923 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error in populating headers :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hdfs/appcache/application_1488294419903_92875/output/attempt_1488294419903_92875_m_000038_0/file.out.index in any of the configured local directories

Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703301818_otto_7293.log.

Completed auto-reimage of hosts:

['analytics1046.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703311426_otto_19376.log.

Completed auto-reimage of hosts:

['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703311704_otto_32707.log.

Completed auto-reimage of hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

and were ALL successful.

Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.

Hm, this is unlikely related, but I noticed the following on my first puppet run after ensuring that JBOD mounts were present:

Notice: /Stage[main]/Nrpe/Service[nagios-nrpe-server]: Triggered 'refresh' from 2 events
Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/Package[hadoop-yarn-nodemanager]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/File[/etc/default/hadoop-yarn-nodemanager]/content:
--- /etc/default/hadoop-yarn-nodemanager        2017-01-20 20:37:11.000000000 +0000
+++ /tmp/puppet-file20170331-23715-ad45uk       2017-03-31 18:47:49.373710088 +0000
@@ -1,18 +1,9 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
+# NOTE: This file is managed by Puppet.
+
 export YARN_IDENT_STRING=yarn
 export YARN_PID_DIR=/var/run/hadoop-yarn
 export YARN_LOG_DIR=/var/log/hadoop-yarn
 export YARN_CONF_DIR=/etc/hadoop/conf
+
+# Set nofiles ulimits for nodemanager processes
+ulimit -n 65536

I don't see a nodemanager refresh happening after this, so it is very possible that on reinstall, the nodemanager is running with a low nofiles ulimit until its next restart.

@elukey, I've restarted nodemanager on an47-an50 just in case.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1029.eqiad.wmnet', 'analytics1030.eqiad.wmnet', 'analytics1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704031251_elukey_27203.log.

Analytics1030 is refusing to boot, opened a phab task: https://phabricator.wikimedia.org/T162046

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704040817_elukey_22316.log.

Completed auto-reimage of hosts:

['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']

Of which those FAILED:

set(['analytics1034.eqiad.wmnet'])

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1036.eqiad.wmnet', 'analytics1037.eqiad.wmnet', 'analytics1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041219_elukey_8140.log.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041309_elukey_21814.log.

Completed auto-reimage of hosts:

['analytics1038.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7091.log.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7170.log.

Completed auto-reimage of hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

Of which those FAILED:

set(['analytics1039.eqiad.wmnet'])

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1052.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041621_otto_601.log.

Completed auto-reimage of hosts:

['analytics1052.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041825_otto_27284.log.

Completed auto-reimage of hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1056.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704042025_otto_19810.log.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1055.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704042032_otto_22444.log.

Completed auto-reimage of hosts:

['analytics1056.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['analytics1055.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1057.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704050912_elukey_18316.log.

Completed auto-reimage of hosts:

['analytics1057.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1035.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704051258_elukey_1539.log.

Completed auto-reimage of hosts:

['analytics1035.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1028.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704051627_otto_14448.log.

Completed auto-reimage of hosts:

['analytics1028.eqiad.wmnet']

and were ALL successful.

Ottomata renamed this task from Reimage all the Hadoop worker nodes to Debian Jessie to Reimage the Hadoop Cluster to Debian Jessie.Apr 6 2017, 3:14 PM

Status:

  • All worker nodes except analytics1030 (down for hw failures) have Debian Jessie
  • Some worker nodes needs to be rebooted to pick up the Linux 4.9 Kernel (already installed)
  • analytics1002 has been reimaged to Debian and hyperthreading has been enabled

Nodes left:

  • analyics1001 and analytics1003

The latter has been scheduled for April 17th since services like Druid and Pivot will need to be stopped during the reimage.

Mentioned in SAL (#wikimedia-operations) [2017-04-10T08:39:36Z] <elukey> manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333)

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['stat1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704111354_otto_10779.log.

analytics1003 was done, so to complete the work we'd need to reimage:

  • stat100[23]
  • analytics1030 (down for maintenance)

All three hosts have tasks (new hardware or dc-ops maintenance) so I'd consider this task completed.

We dont' need to reimage stat100[23]. They should be decommed this quarter.

Nuria set the point value for this task to 8.