Tracking task to the reimage of all the Hadoop worker nodes
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Set Debian Jessie as default image for all the Hadoop worker nodes | operations/puppet | production | +0 -56 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T157807 Reinstall Analytics Hadoop Cluster with Debian Jessie | |||
Resolved | elukey | T160333 Reimage the Hadoop Cluster to Debian Jessie |
Event Timeline
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1042.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703131200_elukey_16132.log.
Change 342448 had a related patch set uploaded (by Elukey):
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes
Change 342448 merged by Elukey:
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes
Completed auto-reimage of hosts:
['analytics1042.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1042.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703131226_elukey_8812.log.
Completed auto-reimage of hosts:
['analytics1042.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1043.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703141240_elukey_32343.log.
Completed auto-reimage of hosts:
['analytics1043.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1044.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703171152_elukey_26003.log.
Completed auto-reimage of hosts:
['analytics1044.eqiad.wmnet']
and were ALL successful.
We found a weird regression only on analytics1044, causing sporadic job failures:
2017-03-27 14:06:02,923 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error in populating headers : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hdfs/appcache/application_1488294419903_92875/output/attempt_1488294419903_92875_m_000038_0/file.out.index in any of the configured local directories
Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1046.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703301818_otto_7293.log.
Completed auto-reimage of hosts:
['analytics1046.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703311426_otto_19376.log.
Completed auto-reimage of hosts:
['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201703311704_otto_32707.log.
Completed auto-reimage of hosts:
['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']
and were ALL successful.
Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.
Hm, this is unlikely related, but I noticed the following on my first puppet run after ensuring that JBOD mounts were present:
Notice: /Stage[main]/Nrpe/Service[nagios-nrpe-server]: Triggered 'refresh' from 2 events Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/Package[hadoop-yarn-nodemanager]/ensure: ensure changed 'purged' to 'present' Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/File[/etc/default/hadoop-yarn-nodemanager]/content: --- /etc/default/hadoop-yarn-nodemanager 2017-01-20 20:37:11.000000000 +0000 +++ /tmp/puppet-file20170331-23715-ad45uk 2017-03-31 18:47:49.373710088 +0000 @@ -1,18 +1,9 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# NOTE: This file is managed by Puppet. + export YARN_IDENT_STRING=yarn export YARN_PID_DIR=/var/run/hadoop-yarn export YARN_LOG_DIR=/var/log/hadoop-yarn export YARN_CONF_DIR=/etc/hadoop/conf + +# Set nofiles ulimits for nodemanager processes +ulimit -n 65536
I don't see a nodemanager refresh happening after this, so it is very possible that on reinstall, the nodemanager is running with a low nofiles ulimit until its next restart.
@elukey, I've restarted nodemanager on an47-an50 just in case.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1029.eqiad.wmnet', 'analytics1030.eqiad.wmnet', 'analytics1031.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704031251_elukey_27203.log.
Analytics1030 is refusing to boot, opened a phab task: https://phabricator.wikimedia.org/T162046
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704040817_elukey_22316.log.
Completed auto-reimage of hosts:
['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']
Of which those FAILED:
set(['analytics1034.eqiad.wmnet'])
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1036.eqiad.wmnet', 'analytics1037.eqiad.wmnet', 'analytics1038.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041219_elukey_8140.log.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1038.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041309_elukey_21814.log.
Completed auto-reimage of hosts:
['analytics1038.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7091.log.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7170.log.
Completed auto-reimage of hosts:
['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']
Of which those FAILED:
set(['analytics1039.eqiad.wmnet'])
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1052.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041621_otto_601.log.
Completed auto-reimage of hosts:
['analytics1052.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704041825_otto_27284.log.
Completed auto-reimage of hosts:
['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1056.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704042025_otto_19810.log.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1055.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704042032_otto_22444.log.
Completed auto-reimage of hosts:
['analytics1056.eqiad.wmnet']
and were ALL successful.
Completed auto-reimage of hosts:
['analytics1055.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1057.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704050912_elukey_18316.log.
Completed auto-reimage of hosts:
['analytics1057.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:
['analytics1035.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704051258_elukey_1539.log.
Completed auto-reimage of hosts:
['analytics1035.eqiad.wmnet']
and were ALL successful.
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['analytics1028.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704051627_otto_14448.log.
Completed auto-reimage of hosts:
['analytics1028.eqiad.wmnet']
and were ALL successful.
Status:
- All worker nodes except analytics1030 (down for hw failures) have Debian Jessie
- Some worker nodes needs to be rebooted to pick up the Linux 4.9 Kernel (already installed)
- analytics1002 has been reimaged to Debian and hyperthreading has been enabled
Nodes left:
- analyics1001 and analytics1003
The latter has been scheduled for April 17th since services like Druid and Pivot will need to be stopped during the reimage.
Mentioned in SAL (#wikimedia-operations) [2017-04-10T08:39:36Z] <elukey> manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333)
Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:
['stat1004.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201704111354_otto_10779.log.
analytics1003 was done, so to complete the work we'd need to reimage:
- stat100[23]
- analytics1030 (down for maintenance)
All three hosts have tasks (new hardware or dc-ops maintenance) so I'd consider this task completed.