Page MenuHomePhabricator

Upgrade the Hadoop test cluster to BigTop
Closed, ResolvedPublic

Description

Time to upgrade the Hadoop test cluster to BigTop to find a possible upgrade/migration procedure from CDH.

Things to keep in mind:

The hdfs namenode daemon is able to do the following:

# When running upgrade, ensure that -renameReserved is added by default.
upgrade|rollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -${@}"
  if [[ ! " ${DAEMON_FLAGS} " =~ " -renameReserved " ]] && [[ " ${DAEMON_FLAGS} " =~ " -upgrade " ]]; then
    DAEMON_FLAGS="$DAEMON_FLAGS -renameReserved"
  fi
  start
  ;;
rollingUpgradeStarted)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade started"
  start
  ;;
rollingUpgradeRollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade rollback"
  start
  ;;
rollingUpgradeDowngrade)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade downgrade"
  start
  ;;

Meanwhile the datanode:

rollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -${1}"
  start
  ;;

Procedure WIP in https://etherpad.wikimedia.org/p/analytics-bigtop

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+5 -0
analytics/refinerymaster+18 -18
operations/cookbooksmaster+16 -14
operations/cookbooksmaster+13 -2
operations/cookbooksmaster+8 -1
operations/cookbooksmaster+14 -10
operations/puppetproduction+6 -6
operations/cookbooksmaster+2 -2
operations/cookbooksmaster+12 -9
operations/cookbooksmaster+381 -0
operations/puppetproduction+10 -10
operations/puppetproduction+10 -10
operations/puppetproduction+10 -7
operations/puppetproduction+3 -0
operations/puppetproduction+7 -8
operations/puppetproduction+15 -352
operations/puppetproduction+4 -3
operations/puppetproduction+0 -12
operations/puppetproduction+1 -1
operations/puppetproduction+12 -2
operations/puppetproduction+1 -0
operations/puppetproduction+5 -3
operations/puppetproduction+15 -0
operations/puppetproduction+13 -5
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 575242 merged by Elukey:
[operations/puppet@production] cdh::hive: improve jar file match regex to work with BigTop

https://gerrit.wikimedia.org/r/575242

Created https://issues.apache.org/jira/browse/BIGTOP-3317 after debugging while the oozie sharedlib create command was failing.

Change 576099 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hive: remove DBTokenStore from hive-site.xml config

https://gerrit.wikimedia.org/r/576099

Change 576099 merged by Elukey:
[operations/puppet@production] cdh::hive: remove DBTokenStore from hive-site.xml config

https://gerrit.wikimedia.org/r/576099

Change 577771 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::server: allow to set the sharedlibpath

https://gerrit.wikimedia.org/r/577771

Change 577771 merged by Elukey:
[operations/puppet@production] profile::oozie::server: allow to set the sharedlibpath

https://gerrit.wikimedia.org/r/577771

@nshahquinn-wmf Hi! In Hadoop test we are testing the migration from Cloudera's CDH to Apache BigTop, that among the other things ships with Hadoop 2.8 and Hive 2.2.3. Would you be interested in doing some tests? I don't have anything in particular in mind, but I recall that you were passionate about Hive 2 so this is why I am asking :)

In case you want to do a quick test, ssh to an-tool1006 (kinit in there as you do in other places)

The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?

There are two use cases:

  1. we upgrade hdfs and we realize that something is wrong from the first tests, so we rollback. No issue with Hive/Oozie, since their db state didn't change.
  2. we upgrade hdfs, and then we realize only few days afterwards that something is broken and we need to rollback.

Case 2) is challenging since multiple users plus our recurrent jobs would have already changed their state. Rolling back to a previous db status might cause inconsistencies here and there that will be difficult to debug/deal-with. Suggestions/comments?

elukey lowered the priority of this task from High to Medium.Mar 20 2020, 1:15 PM
elukey added a project: Analytics-Kanban.

Today while re-installing the new version of oozie/hadoop packages I experienced a problem that I forgot to fix, namely:

Unpacking oozie (4.3.0-2) ...
dpkg: error processing archive /var/cache/apt/archives/oozie_4.3.0-2_all.deb (--unpack):
 trying to overwrite '/usr/lib/oozie/lib/accessors-smart-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Errors were encountered while processing:
 /var/cache/apt/archives/oozie_4.3.0-2_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

That I manually resolved like:

elukey@analytics1030:~$ sudo dpkg -i --force-overwrite /var/cache/apt/archives/oozie_4.3.0-2_all.deb
(Reading database ... 106775 files and directories currently installed.)
Preparing to unpack .../archives/oozie_4.3.0-2_all.deb ...
Unpacking oozie (4.3.0-2) ...
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/accessors-smart-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/activemq-client-5.13.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/apacheds-i18n-2.0.0-M15.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/apacheds-kerberos-codec-2.0.0-M15.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/api-asn1-api-1.0.0-M20.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/api-util-1.0.0-M20.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/asm-5.0.4.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-cli-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-codec-1.4.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-logging-1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/curator-client-2.5.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/curator-framework-2.5.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/geronimo-j2ee-management_1.1_spec-1.0.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/geronimo-jms_1.1_spec-1.1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/guava-11.0.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/hawtbuf-1.11.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/httpclient-4.3.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/httpcore-4.3.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jackson-core-asl-1.9.13.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jackson-mapper-asl-1.9.13.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jcip-annotations-1.0-1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jline-0.9.94.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/json-simple-1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/json-smart-2.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jsr305-1.3.9.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/netty-3.7.0.Final.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/nimbus-jose-jwt-4.41.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/oozie-client-4.3.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/oozie-hadoop-auth-hadoop-2-4.3.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/slf4j-api-1.6.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/slf4j-simple-1.6.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/xercesImpl-2.10.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/xml-apis-1.4.01.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/zookeeper-3.4.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: dependency problems prevent configuration of oozie:
 oozie depends on oozie-client (= 4.3.0-2); however:
  Package oozie-client is not configured yet.

dpkg: error processing package oozie (--install):
 dependency problems - leaving unconfigured
Processing triggers for systemd (232-25+deb9u12) ...
Errors were encountered while processing:
 oozie

elukey@analytics1030:~$ sudo apt-get install oozie -f
Reading package lists... Done
Building dependency tree
Reading state information... Done
oozie is already the newest version (4.3.0-2).
The following packages were automatically installed and are no longer required:
  blt libprotobuf10 libssl1.0.0 linux-image-4.9.0-8-amd64 net-tools python-backports-shutil-get-terminal-size python-cycler python-enum34 python-etcd python-funcsigs python-functools32 python-ipython python-ipython-genutils python-joblib
  python-jsonschema python-lxml python-maxminddb python-mock python-mpmath python-pathlib2 python-pbr python-pickleshare python-prompt-toolkit python-protobuf python-pyparsing python-simplegeneric python-subprocess32 python-traitlets python-tz
  python-wcwidth ruby-nokogiri ruby-pkg-config ruby-rgen tk8.6-blt2.5
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
No directory, logging in with HOME=/
INFO:debmonitor:Got 0 updates from dpkg hook version 3
INFO:debmonitor:Successfully sent the dpkg_hook update to the DebMonitor server
Setting up oozie-client (4.3.0-2) ...
Setting up oozie (4.3.0-2) ...
update-alternatives: using /etc/oozie/tomcat-conf.http to provide /etc/oozie/tomcat-conf (oozie-tomcat-conf) in auto mode
Processing triggers for man-db (2.7.6.1-2) ...
Processing triggers for systemd (232-25+deb9u12) ...

So it seems that oozie's packages are conflicting due to both providing an overlapping set of /usr/lib/oozie/lib jars..

Mentioned in SAL (#wikimedia-operations) [2020-03-23T11:27:31Z] <elukey> upload oozie 4.3.0-3 to thirparty/bigtop14 on wikimedia-stretch - T244499

Fixed, deployed and tested.

The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?

There are two use cases:

  1. we upgrade hdfs and we realize that something is wrong from the first tests, so we rollback. No issue with Hive/Oozie, since their db state didn't change.
  2. we upgrade hdfs, and then we realize only few days afterwards that something is broken and we need to rollback.

Case 2) is challenging since multiple users plus our recurrent jobs would have already changed their state. Rolling back to a previous db status might cause inconsistencies here and there that will be difficult to debug/deal-with. Suggestions/comments?

After a chat with Joseph we agreed that case 2) could be handled simply allowing a limited amount of time for the HDFS upgrade-before-finalize time, and warn people that the state of hive/oozie might be rolledback during that timeframe.

Change 583065 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Avoid overriding Hadoop's core files to allow IPv6

https://gerrit.wikimedia.org/r/583065

Change 583065 merged by Elukey:
[operations/puppet@production] Avoid overriding Hadoop's core files to allow IPv6

https://gerrit.wikimedia.org/r/583065

Change 583069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Restore CDH settings for Hadoop Test

https://gerrit.wikimedia.org/r/583069

Change 583069 merged by Elukey:
[operations/puppet@production] Restore CDH settings for Hadoop Test

https://gerrit.wikimedia.org/r/583069

Change 583303 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set maximum failover retry attempts for HDFS in Hadoop Test

https://gerrit.wikimedia.org/r/583303

Change 583303 merged by Elukey:
[operations/puppet@production] Set maximum failover retry attempts for HDFS in Hadoop Test

https://gerrit.wikimedia.org/r/583303

The first attempt of rollback was a disaster, I wasn't able to restore HDFS to its previous state.

From the documentation it seemed possible to rollback the state of HDFS after the upgrade but before having finalized it. Today I tried but the HDFS namenodes refused to comply, erroring out problems about gaps in the edit log. After a bit of research, and comparing the errors with edit/fsimage file names (since they contain range of transactions) I came to the conclusion that with QJM (journal nodes) this kind of rollback is difficult or not possible. The main problem is that the Namenodes create a fsimage to rollback to, but then after weeks the gap between what it is stored in the edit log and the last transaction of the rollback fsimage is too big (since it has been folded to new fsimages). This means that when the Namenode tries to rollback, it reads the rollback fsimage and tries to pull the missing edits from the edit log, but they are not there anymore.

I also tried to perform another rollback, the one that avoids to preserve the data generated/added to HDFS between the upgrade and the rollback. It didn't work either since apparently doing a rolling upgrade didn't create the necessary fsimages where the namenode expected them. I haven't tried to manually move the fsimages for the rolling upgrade in different fs locations (just thought about it now), but it wouldn't have been super clean anyway.

From https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html it seems that we should attempt a regular upgrade (not a rolling upgrade), with the caveat that all the data between upgrade and rollback might be lost. This seems to be the only way with a QJM, even if if feels a little bit strange.

The other option is that the rolling upgrade works with QJM but only if the gap between upgrade and rollback is limited (so the editlog is kept available), but I didn't see any trace of this in the docs. I'll try to follow up with the bigtop mailing list to see if anybody had the same experience.

As reference, I filed https://issues.apache.org/jira/browse/BIGTOP-3341 to ask for complete Openssl 1.1.1 support in BigTop 1.5 (so when we'll migrate to Buster no issue should arise).

Change 598450 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set BigTop repository config for the Hadoop Test cluster

https://gerrit.wikimedia.org/r/598450

Change 598450 merged by Elukey:
[operations/puppet@production] Set BigTop repository config for the Hadoop Test cluster

https://gerrit.wikimedia.org/r/598450

Upgrade a second time and failed with a different issue. This time, I ended up with a lot of missing/under-replicated blocks and also ~7% of the total ones corrupted.

Judging from https://docs.cloudera.com/documentation/enterprise/5-15-x/topics/cdh_ig_earlier_cdh5_upgrade.html#topic_6_3_10 I think that I haven't waited long enough before moving from the primary namenode bootstrap to the secondary, and then to the datanodes.

Will attempt a rollback and a roll forward :(

Change 598693 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set CDH repository back for the Hadoop test cluster

https://gerrit.wikimedia.org/r/598693

Change 598693 merged by Elukey:
[operations/puppet@production] Set CDH repository back for the Hadoop test cluster

https://gerrit.wikimedia.org/r/598693

I was able to rollback successfully but the caveat was that even the datanodes need to be rolledback. When the upgrade command is issued to the namenode, it does two things:

  1. saves a copy of the fs image as "previous" in a know location │··
  2. tells to all the datanodes to do the same, using hard-links (under /var/lib/hadoop/data/$letter/dn/current/BP-etc../ one can see a previous and a current directory, normally we have only current)

Then there are two possibilities:

  1. finalize the upgrade, so the previous state is discarded │··
  2. rollback, the current state is discarded

Change 605858 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Bigtop for Hadoop test

https://gerrit.wikimedia.org/r/605858

Change 605858 merged by Elukey:
[operations/puppet@production] Set Bigtop for Hadoop test

https://gerrit.wikimedia.org/r/605858

Today I was able to rollback BigTop, and since the previous attempt to rollout went fine, this is the first time that we do back and forth without corrupting HDFS. To avoid loosing any important data:

Upgrade from CDH to BigTop

=== safety steps ===

- merge the puppet change to use the BigTop repo and run it everywhere (so we'll have packages ready toinstall with puppet disabled later on)

- disable puppet on all hosts
  sudo cumin 'analytics10[28-41]*  or an-tool1006*' 'disable-puppet "elukey - upgrading to bigtop"' 

- Stop Oozie and Hive, + timers
   sudo cumin 'analytics1030*' 'systemctl stop oozie'
   sudo cumin 'analytics1030*' 'systemctl stop hive-server2'
   sudo cumin 'analytics1030*' 'systemctl stop hive-metastore'
   sudo cumin 'analytics1030*' 'systemctl stop *.timer'
  sudo cumin 'analytics1030*' 'systemctl stop presto-server'

- unmount /mnt/hdfs
  sudo cumin 'an-tool1006*' 'umount /mnt/hdfs'
  sudo cumin 'analytics1030*' 'umount /mnt/hdfs'

- Stop all daemons like Hue, Jupyter, etc..
   sudo cumin 'analytics1039*' 'systemctl stop hue'
   sudo cumin 'an-tool1006*' 'systemctl stop jupyterhub' (do we need to stop also all notebooks for prod?)

- Check running jobs on workers
  sudo cumin 'A:hadoop-worker-test' 'ps aux | grep java| egrep -v "JournalNode|DataNode|NodeManager"'

- Check HDFS Active/Standby
   sudo cumin 'analytics1028*' 'sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState analytics1028-eqiad-wmnet'
   sudo cumin 'analytics1028*' 'sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState analytics1029-eqiad-wmnet'

- enter HDFS Safe mode
   sudo cumin 'analytics1028*' 'sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter'
   sudo cumin 'analytics1028*' 'sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace'

- Backup namenode dir on the HDFS master node
   sudo cumin 'analytics1028*' 'cd /var/lib/hadoop/name && tar -cvf /root/hadoop-namedir-backup-bigtop-upgrade-$(date +%s).tar .'

- backup separately each database that we are interested in (like hive_metastore, oozie, etc..). One giant backup is more difficult to restore.
sudo cumin 'analytics1030*' 'mysqldump hive_metastore > oozie_$(date +%s).sql'
sudo cumin 'analytics1030*' 'mysqldump oozie > oozie_$(date +%s).sql'

- Don't upgrade Hue to the new packages, use the CDH ones for the moment.

== Procedure ==

Described in https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html for HDFS
Stop the whole cluster as usual:
    - Yarn node managers + Yarn RM first
       sudo cumin 'A:hadoop-worker-test' 'systemctl stop hadoop-yarn-nodemanager'
       sudo cumin 'analytics1028*' 'systemctl stop hadoop-yarn-resourcemanager'
       sudo cumin 'analytics1029*' 'systemctl stop hadoop-yarn-resourcemanager'
    - all HDFS datanodes
       sudo cumin 'A:hadoop-worker-test' 'systemctl stop hadoop-hdfs-datanode' -b 1 -s 60
    - Secondary NN, Active NN down
      sudo cumin 'analytics1029*' 'systemctl stop hadoop-hdfs-namenode'
      sudo cumin 'analytics1029*' 'systemctl stop hadoop-hdfs-zkfc'
      sudo cumin 'analytics1028*' 'systemctl stop hadoop-mapreduce-historyserver'
    - JournalNodes
      sudo cumin 'A:hadoop-hdfs-journal-test' 'systemctl stop hadoop-hdfs-journalnode' -b 1 -s 60

- Run ps -aux | grep java across all nodes to review if there are jvms running and if it is ok (Druid for example is ok to keep running).

- Remove Yarn zookeeper znodes:
    setAcl /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot world:anyone:cdrwa
    rmr /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot

- On all worker nodes:
    sudo cumin 'A:hadoop-worker-test' 'rm -rf /tmp/hadoop-yarn/*'
    sudo cumin 'A:hadoop-worker-test' "dpkg -l | grep cdh | awk '{print \$2}' | tr '\n' ' ' > /home/elukey/cdh_package_list"
     sudo cumin 'A:hadoop-worker-test' "apt-get remove -y \$(cat /home/elukey/cdh_package_list)"
     sudo cumin 'A:hadoop-worker-test' "apt-cache policy hadoop"
     sudo cumin 'A:hadoop-hdfs-journal-test' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`' -b 1 -s 60
     sudo cumin 'A:hadoop-worker-test and not A:hadoop-hdfs-journal-test'  'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`' -b 1 -s 60

- At this point, Yarn NM and Hadoop JN/DN are all up.
   sudo cumin 'A:hadoop-worker-test' 'ps aux | grep java| egrep "JournalNode|DataNode|NodeManager" | grep -v egrep| wc -l'

- Upgrade packages on 1001
sudo cumin 'analytics102*' "dpkg -l | grep cdh | awk '{print \$2}' | tr '\n' ' ' > /home/elukey/cdh_package_list"
sudo cumin 'analytics102*' "apt-get remove -y \$(cat /home/elukey/cdh_package_list)"
sudo cumin 'analytics1028*' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`'

- Start the NN on 1001 with the -upgrade flag:
    sudo service hadoop-hdfs-namenode upgrade
- Start Yarn and the history server without any particular flag

Upgrade the packages on 1002

sudo cumin 'analytics1029*' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`'
- Start the NN on 1002 with the -bootstrapStandy flag:
    sudo cumin 'analytics1029*' 'sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs namenode -bootstrapStandby'
    sudo cumin 'analytics1029*' 'systemctl start hadoop-hdfs-namenode'
-  Start Yarn without any particular flag

Upgrade the Hive metastore:
    /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 1.1.0


Rollback Bigtop to CDH:

sudo cumin 'analytics1028*' 'echo Y | sudo -u hdfs kerberos-run-command hdfs hdfs namenode -rollback'
sudo cumin -m async 'A:hadoop-worker-test' 'systemctl stop hadoop-hdfs-datanode' 'service hadoop-hdfs-datanode rollback' 'systemctl start hadoop-hdfs-datanode' -b 1 -s 30

Change 606736 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] WIP - Add sre.hadoop.change-distro.py

https://gerrit.wikimedia.org/r/606736

Change 606736 merged by Elukey:
[operations/cookbooks@master] hadoop - Add change-distro.py and stop-cluster.py

https://gerrit.wikimedia.org/r/606736

Change 609436 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.stop-cluster.py: fix minor errors/details

https://gerrit.wikimedia.org/r/609436

Change 609436 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.stop-cluster.py: fix minor errors/details

https://gerrit.wikimedia.org/r/609436

Change 609442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro.py: fix misc details

https://gerrit.wikimedia.org/r/609442

Change 609442 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro.py: fix misc details

https://gerrit.wikimedia.org/r/609442

Change 609452 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set BigTop for Hadoop master/standby/worker nodes.

https://gerrit.wikimedia.org/r/609452

Change 609452 merged by Elukey:
[operations/puppet@production] Set BigTop for Hadoop master/standby/worker nodes.

https://gerrit.wikimedia.org/r/609452

Change 609975 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro: improve procedure and logging

https://gerrit.wikimedia.org/r/609975

Change 609975 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro: improve procedure and logging

https://gerrit.wikimedia.org/r/609975

The cookbooks seem to run fine, but sometimes during rollback I get instances of the following problems on journalnodes:

2020-07-08 14:44:50,538 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.64.36.128: This node has namespaceId '0 and clusterId '' but the requesting node expected '2082959117' and 'CID-a9158735-e6ad-4da1-8caf-897f0d650a79'

The Namenode, right after the rollback command, cannot start due to the journal nodes being in a weird state, like they don't recognize their /var/lib/hadoop/journal directory content (when they log this the VERSION file contains namespace and clusterId as the namenode expects). After a few restarts, they recognize the edit files and everything starts working.

These are the logs when the journal nodes are in the weird state:

2020-07-08 14:38:02,145 INFO org.mortbay.log: Started SslSocketConnectorSecure@0.0.0.0:8481
2020-07-08 14:38:02,188 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
2020-07-08 14:38:02,206 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2020-07-08 14:38:02,463 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-07-08 14:38:02,463 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting
2020-07-08 14:40:15,954 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:40:16,127 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /var/lib/hadoop/journal/analytics-test-hadoop
2020-07-08 14:40:16,146 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /var/lib/hadoop/journal/analytics-test-hadoop/in_use.lock acquired by nodename 34988@analytics1038
2020-07-08 14:40:16,161 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 14:40:16,310 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 14:40:16,311 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.
2020-07-08 14:44:22,025 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:44:22,035 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.
2020-07-08 14:44:46,942 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:44:50,538 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.64.36.128: This node has namespaceId '0 and clusterId '' but the requesting node expected '2082959117' and 'CID-a9158735-e6ad-4da1-8caf-897f0d650a79'

Meanwhile the following logs are related to a good state (that allows the Namenode to start):

2020-07-08 14:59:26,975 INFO org.mortbay.log: Started SslSocketConnectorSecure@0.0.0.0:8481
2020-07-08 14:59:27,018 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
2020-07-08 14:59:27,036 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2020-07-08 14:59:27,314 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-07-08 14:59:27,314 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting
2020-07-08 15:00:08,067 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 15:00:08,210 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /var/lib/hadoop/journal/analytics-test-hadoop
2020-07-08 15:00:08,232 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /var/lib/hadoop/journal/analytics-test-hadoop/in_use.lock acquired by nodename 36134@analytics1038
2020-07-08 15:00:08,262 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 15:00:08,393 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 15:00:13,156 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 15:00:13,224 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastPromisedEpoch from 34 to 35 for client /10.64.36.128
2020-07-08 15:00:13,230 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 15:00:13,279 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 15:00:13,351 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: getSegmentInfo(6330469): EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false) -> startTxId: 6330469 endTxId: 6330469 isInProgress: false

The main difference seems to be:

2020-07-08 14:40:16,311 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.

There is no mention in tutorials and in the init.d files about the possibility of rolling back a journalnode's state, maybe there is a step missing in the procedure?

I tried again an upgrade and checked one of the journal nodes, finding a previous state:

elukey@analytics1031:~$ ls /var/lib/hadoop/journal/analytics-test-hadoop/
current  in_use.lock  previous

I didn't notice it when trying to fix the error during rollback, so the journal nodes must go through the rollback process by themselves and/or they are guided by the Namenodes.

Change 610336 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop: add logging and more backup actions

https://gerrit.wikimedia.org/r/610336

Change 610336 merged by Elukey:
[operations/cookbooks@master] sre.hadoop: add logging and more backup actions

https://gerrit.wikimedia.org/r/610336

Change 610721 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro: modify restart procedure and remove previous state

https://gerrit.wikimedia.org/r/610721

Change 610721 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro: modify restart procedure and remove previous state

https://gerrit.wikimedia.org/r/610721

I've done another round of rollout/rollback, and I found the following interesting log:

2020-07-09 08:52:37,907 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Rollback of /var/lib/hadoop/journal/analytics-test-hadoop is complete.

This happens only when we issue the NN rollback command, the previous state is not removed before that (package downgrade, restarts, etc..). I found that stopping completely the JNs after the package downgrade, and then start them again helps with the spurious bug aforementioned. Updated the cookbooks, will do another round of tests to verify that now it is better.

Change 611392 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro.py: change logic for JN roll restart

https://gerrit.wikimedia.org/r/611392

Change 611392 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro.py: change logic for JN roll restart

https://gerrit.wikimedia.org/r/611392

Instructions to upgrade the coordinator (or where hive/oozie run):

dpkg -l | awk '/+cdh/ {print $2}' | tr '\n' ' ' > /root/cdh_package_list
apt-get remove -y `cat /root/cdh_package_list`
apt-get install -y `cat /root/cdh_package_list | tr ' ' '\n' | egrep -v 'avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry|flume-ng|pig' | tr '\n' ' '`
/usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 1.1.0

For the client (an-tool1006) it is sufficient to remove the cdh packages, run puppet and restart jupyterhub (we'll do something similar for the stat boxes probably).

Change 619466 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] hive: quote all usages of percent/range words

https://gerrit.wikimedia.org/r/619466

elukey@analytics1028:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -finalizeUpgrade
Finalize upgrade successful for analytics1028.eqiad.wmnet/10.64.36.128:8020
Finalize upgrade successful for analytics1029.eqiad.wmnet/10.64.36.129:8020

Took few seconds, no issues registered. Probably in a bigger and more crowded HDFS environment it will take more time.

These are the logs that I found on the HDFS active namenode:

[...]
2020-08-11 08:50:41,450 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.36.133:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-62fe8cb8-8071-488c-ab73-55ddfaae71a6:NORMAL:10.64.53.17:50010|RBW], ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED]]} size 0
2020-08-11 08:50:41,451 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.53.17:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-0c8b6382-d655-4fb7-96c6-5a9282e58812:NORMAL:10.64.53.17:50010|FINALIZED]]} size 0
2020-08-11 10:15:45,468 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /var/lib/hadoop/name/current
2020-08-11 14:14:48,759 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/name is complete.

And on the standby:

[...]
2020-08-11 08:51:44,504 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.53.17:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-0c8b6382-d655-4fb7-96c6-5a9282e58812:NORMAL:10.64.53.17:50010|FINALIZED]]} size 0
2020-08-11 14:14:49,402 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/name is complete.

And on one journal node:

[...]
2020-07-14 11:04:53,318 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Validating log segment /var/lib/hadoop/journal/analytics-test-hadoop/current/edits_inprogress_0000000000006341117 about to be finalized
2020-08-11 10:15:45,398 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Validating log segment /var/lib/hadoop/journal/analytics-test-hadoop/current/edits_inprogress_0000000000007483874 about to be finalized
2020-08-11 14:14:49,217 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/journal/analytics-test-hadoop is complete.

Change 619466 merged by Nuria:
[analytics/refinery@master] hive: quote all usages of percent/range words

https://gerrit.wikimedia.org/r/619466

Spark2 seems not able to run any Yarn jobs, I see the following in the application's logs:

20/08/12 07:22:09 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/08/12 07:22:09 WARN ScriptBasedMapping: Exception running /etc/hadoop/conf.analytics-test-hadoop/net-topology.sh 10.64.53.15
ExitCodeException exitCode=143:
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:575)
	at org.apache.hadoop.util.Shell.run(Shell.java:478)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
	at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
	at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
	at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
	at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
	at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
	at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:37)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:422)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:421)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:421)

And on the Yarn node manager:

Full command array for failed execution:
[/usr/lib/hadoop-yarn/bin/container-executor, analytics, analytics, 1, application_1597140872253_0524, container_1597140872253_0524_01_000002, /var/lib/hadoop/data/g/yarn/local/usercache/analy
tics/appcache/application_1597140872253_0524/container_1597140872253_0524_01_000002, /var/lib/hadoop/data/m/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_
000002/launch_container.sh, /var/lib/hadoop/data/g/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_000002/container_1597140872253_0524_01_000002.tokens, /var/lib/hadoop/data/m/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_000002/container_1597140872253_0524_01_000002.pid, /var/lib/hadoop/data/b/yarn/local%/var/lib/hadoop/data/e/yarn/local%/var/lib/hadoop/data/g/yarn/local%/var/lib/hadoop/data/i/yarn/local%/var/lib/hadoop/data/j/yarn/local%/var/lib/hadoop/data/l/yarn/local%/var/lib/hadoop/data/m/yarn/local, /var/lib/hadoop/data/b/yarn/logs%/var/lib/hadoop/data/e/yarn/logs%/var/lib/hadoop/data/g/yarn/logs%/var/lib/hadoop/data/i/yarn/logs%/var/lib/hadoop/data/j/yarn/logs%/var/lib/hadoop/data/l/yarn/logs%/var/lib/hadoop/data/m/yarn/logs, cgroups=none]
2020-08-12 08:10:33,467 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception:
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=143:
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:107)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:130)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:395)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

It seems as if the SparkRackResolver on every worker tries to run /etc/hadoop/conf.analytics-test-hadoop/net-topology.sh getting an error from the container and/or the file system.

After a lot of debugging, the issue seemed to be related to the AM getting killed for too much virtual memory used:

2020-08-12 14:39:51,594 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=17413,containerID=container_1597140872253_0672_04_000001] is running beyond virtual memory limits. Current usage: 338.7 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
[..]
Full command array for failed execution:
[/usr/lib/hadoop-yarn/bin/container-executor, analytics, analytics, 1, application_1597140872253_0672, container_1597140872253_0672_04_000001, /var/lib/hadoop/data/k/yarn/local/usercache/analytics/appcache/application_1597140872253_0672/container_1597140872253_0672_04_000001, /var/lib/hadoop/data/k/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/launch_container.sh, /var/lib/hadoop/data/h/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/container_1597140872253_0672_04_000001.tokens, /var/lib/hadoop/data/i/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/container_1597140872253_0672_04_000001.pid, /var/lib/hadoop/data/b/yarn/local%/var/lib/hadoop/data/c/yarn/local%/var/lib/hadoop/data/d/yarn/local%/var/lib/hadoop/data/e/yarn/local%/var/lib/hadoop/data/f/yarn/local%/var/lib/hadoop/data/g/yarn/local%/var/lib/hadoop/data/h/yarn/local%/var/lib/hadoop/data/i/yarn/local%/var/lib/hadoop/data/j/yarn/local%/var/lib/hadoop/data/k/yarn/local%/var/lib/hadoop/data/l/yarn/local%/var/lib/hadoop/data/m/yarn/local, /var/lib/hadoop/data/b/yarn/logs%/var/lib/hadoop/data/c/yarn/logs%/var/lib/hadoop/data/d/yarn/logs%/var/lib/hadoop/data/e/yarn/logs%/var/lib/hadoop/data/f/yarn/logs%/var/lib/hadoop/data/g/yarn/logs%/var/lib/hadoop/data/h/yarn/logs%/var/lib/hadoop/data/i/yarn/logs%/var/lib/hadoop/data/j/yarn/logs%/var/lib/hadoop/data/k/yarn/logs%/var/lib/hadoop/data/l/yarn/logs%/var/lib/hadoop/data/m/yarn/logs, cgroups=none]
2020-08-12 14:39:51,601 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception:

Since spark2-shell and eventlogging_to_druid_navigationtiming_hourly in hadoop test are both using spark in client mode (the latter due to better alarming with timers/icinga), the following resolved the problem: --conf spark.yarn.am.memory=2g (by default is 512m).

My speculation is that on hadoop 2.8.5 there are more libs etc.. loaded in memory for the same container, that added on top of the spark ones get to the container's limits. Adding spark.yarn.am.memory=2g to spark-defaults seems to be good from my perspective.

Change 619788 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test

https://gerrit.wikimedia.org/r/619788

Change 619788 merged by Elukey:
[operations/puppet@production] Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test

https://gerrit.wikimedia.org/r/619788

yarn.nodemanager.vmem-pmem-ratio set to 5.1 (5 to 1, default 2 to 1) seems also to solve the issue!

After the last change, I have tested (from an-tool1006):

  • pyspark and spark shells
  • refine and druid indexation timers (all spark based, the former running in cluster mode and the latter in client mode)
  • pyspark via notebooks

Everything looking good, no errors registered. I have also checked that Druid indexations were successful, no error reported on that side as well.

Overall it seems that there is no blocker with Bigtop, everything works as expected. The next step will be to have Joseph to do another round of tests to find what I have missed :)