Page MenuHomePhabricator

Reimage the Hadoop Cluster to Debian Jessie
Closed, ResolvedPublic8 Estimated Story Points

Description

Tracking task to the reimage of all the Hadoop worker nodes

Event Timeline

elukey created this task.Mar 13 2017, 12:00 PM

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703131200_elukey_16132.log.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Mar 13 2017, 12:02 PM

Change 342448 had a related patch set uploaded (by Elukey):
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes

https://gerrit.wikimedia.org/r/342448

Change 342448 merged by Elukey:
[operations/puppet] Set Debian Jessie as default image for all the Hadoop worker nodes

https://gerrit.wikimedia.org/r/342448

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703131226_elukey_8812.log.

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Nuria raised the priority of this task from Medium to High.Mar 13 2017, 3:41 PM
Nuria moved this task from Incoming to Wikistats on the Analytics board.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1043.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703141240_elukey_32343.log.

Completed auto-reimage of hosts:

['analytics1043.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1044.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703171152_elukey_26003.log.

Completed auto-reimage of hosts:

['analytics1044.eqiad.wmnet']

and were ALL successful.

We found a weird regression only on analytics1044, causing sporadic job failures:

2017-03-27 14:06:02,923 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error in populating headers :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hdfs/appcache/application_1488294419903_92875/output/attempt_1488294419903_92875_m_000038_0/file.out.index in any of the configured local directories

Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 30 2017, 2:00 PM

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703301818_otto_7293.log.

Completed auto-reimage of hosts:

['analytics1046.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703311426_otto_19376.log.

Completed auto-reimage of hosts:

['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703311704_otto_32707.log.

Completed auto-reimage of hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

and were ALL successful.

Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any good info/trace/etc.. to identify the root cause of the problem.

Hm, this is unlikely related, but I noticed the following on my first puppet run after ensuring that JBOD mounts were present:

Notice: /Stage[main]/Nrpe/Service[nagios-nrpe-server]: Triggered 'refresh' from 2 events
Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/Package[hadoop-yarn-nodemanager]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Cdh::Hadoop::Nodemanager/File[/etc/default/hadoop-yarn-nodemanager]/content:
--- /etc/default/hadoop-yarn-nodemanager        2017-01-20 20:37:11.000000000 +0000
+++ /tmp/puppet-file20170331-23715-ad45uk       2017-03-31 18:47:49.373710088 +0000
@@ -1,18 +1,9 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
+# NOTE: This file is managed by Puppet.
+
 export YARN_IDENT_STRING=yarn
 export YARN_PID_DIR=/var/run/hadoop-yarn
 export YARN_LOG_DIR=/var/log/hadoop-yarn
 export YARN_CONF_DIR=/etc/hadoop/conf
+
+# Set nofiles ulimits for nodemanager processes
+ulimit -n 65536

I don't see a nodemanager refresh happening after this, so it is very possible that on reinstall, the nodemanager is running with a low nofiles ulimit until its next restart.

@elukey, I've restarted nodemanager on an47-an50 just in case.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1029.eqiad.wmnet', 'analytics1030.eqiad.wmnet', 'analytics1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704031251_elukey_27203.log.

elukey added a comment.Apr 3 2017, 2:09 PM

Analytics1030 is refusing to boot, opened a phab task: https://phabricator.wikimedia.org/T162046

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704040817_elukey_22316.log.

Completed auto-reimage of hosts:

['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet']

Of which those FAILED:

set(['analytics1034.eqiad.wmnet'])

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1036.eqiad.wmnet', 'analytics1037.eqiad.wmnet', 'analytics1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041219_elukey_8140.log.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041309_elukey_21814.log.

Completed auto-reimage of hosts:

['analytics1038.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7091.log.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041421_elukey_7170.log.

Completed auto-reimage of hosts:

['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

Of which those FAILED:

set(['analytics1039.eqiad.wmnet'])

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1052.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041621_otto_601.log.

Completed auto-reimage of hosts:

['analytics1052.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704041825_otto_27284.log.

Completed auto-reimage of hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1056.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704042025_otto_19810.log.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1055.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704042032_otto_22444.log.

Completed auto-reimage of hosts:

['analytics1056.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['analytics1055.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1057.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704050912_elukey_18316.log.

Completed auto-reimage of hosts:

['analytics1057.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1035.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704051258_elukey_1539.log.

Completed auto-reimage of hosts:

['analytics1035.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['analytics1028.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704051627_otto_14448.log.

Completed auto-reimage of hosts:

['analytics1028.eqiad.wmnet']

and were ALL successful.

Ottomata renamed this task from Reimage all the Hadoop worker nodes to Debian Jessie to Reimage the Hadoop Cluster to Debian Jessie.Apr 6 2017, 3:14 PM
elukey added a comment.EditedApr 7 2017, 2:40 PM

Status:

  • All worker nodes except analytics1030 (down for hw failures) have Debian Jessie
  • Some worker nodes needs to be rebooted to pick up the Linux 4.9 Kernel (already installed)
  • analytics1002 has been reimaged to Debian and hyperthreading has been enabled

Nodes left:

  • analyics1001 and analytics1003

The latter has been scheduled for April 17th since services like Druid and Pivot will need to be stopped during the reimage.

Mentioned in SAL (#wikimedia-operations) [2017-04-10T08:39:36Z] <elukey> manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333)

Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['stat1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704111354_otto_10779.log.

analytics1003 was done, so to complete the work we'd need to reimage:

  • stat100[23]
  • analytics1030 (down for maintenance)

All three hosts have tasks (new hardware or dc-ops maintenance) so I'd consider this task completed.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Apr 28 2017, 2:52 PM

We dont' need to reimage stat100[23]. They should be decommed this quarter.

elukey moved this task from In Progress to Done on the User-Elukey board.May 2 2017, 4:25 PM
Nuria closed this task as Resolved.May 4 2017, 5:07 PM
Nuria set the point value for this task to 8.