Upgrade the Hadoop test cluster to BigTop
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Feb 6 2020, 4:22 PM

Description

Time to upgrade the Hadoop test cluster to BigTop to find a possible upgrade/migration procedure from CDH.

Things to keep in mind:

Hadoop migrates from 2.6 to 2.8, so HDFS needs to be upgraded following https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Upgrade_and_Rollback or https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html
Hive migrates from 1.3 to 2.x, so our jobs needs to be tested.

The hdfs namenode daemon is able to do the following:

# When running upgrade, ensure that -renameReserved is added by default.
upgrade|rollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -${@}"
  if [[ ! " ${DAEMON_FLAGS} " =~ " -renameReserved " ]] && [[ " ${DAEMON_FLAGS} " =~ " -upgrade " ]]; then
    DAEMON_FLAGS="$DAEMON_FLAGS -renameReserved"
  fi
  start
  ;;
rollingUpgradeStarted)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade started"
  start
  ;;
rollingUpgradeRollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade rollback"
  start
  ;;
rollingUpgradeDowngrade)
  DAEMON_FLAGS="$DAEMON_FLAGS -rollingUpgrade downgrade"
  start
  ;;

Meanwhile the datanode:

rollback)
  DAEMON_FLAGS="$DAEMON_FLAGS -${1}"
  start
  ;;

Procedure WIP in https://etherpad.wikimedia.org/p/analytics-bigtop

Details

Subject	Repo	Branch	Lines +/-
Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test	operations/puppet	production	+5 -0
hive: quote all usages of percent/range words	analytics/refinery	master	+18 -18
sre.hadoop.change-distro.py: change logic for JN roll restart	operations/cookbooks	master	+16 -14
sre.hadoop.change-distro: modify restart procedure and remove previous state	operations/cookbooks	master	+13 -2
sre.hadoop: add logging and more backup actions	operations/cookbooks	master	+8 -1
sre.hadoop.change-distro: improve procedure and logging	operations/cookbooks	master	+14 -10
Set BigTop for Hadoop master/standby/worker nodes.	operations/puppet	production	+6 -6
sre.hadoop.change-distro.py: fix misc details	operations/cookbooks	master	+2 -2
sre.hadoop.stop-cluster.py: fix minor errors/details	operations/cookbooks	master	+12 -9
hadoop - Add change-distro.py and stop-cluster.py	operations/cookbooks	master	+381 -0
Set Bigtop for Hadoop test	operations/puppet	production	+10 -10
Set CDH repository back for the Hadoop test cluster	operations/puppet	production	+10 -10
Set BigTop repository config for the Hadoop Test cluster	operations/puppet	production	+10 -7
Set maximum failover retry attempts for HDFS in Hadoop Test	operations/puppet	production	+3 -0
Restore CDH settings for Hadoop Test	operations/puppet	production	+7 -8
Avoid overriding Hadoop's core files to allow IPv6	operations/puppet	production	+15 -352
profile::oozie::server: allow to set the sharedlibpath	operations/puppet	production	+4 -3
cdh::hive: remove DBTokenStore from hive-site.xml config	operations/puppet	production	+0 -12
cdh::hive: improve jar file match regex to work with BigTop	operations/puppet	production	+1 -1
Add specific settings for libcrypto in Hadoop Test	operations/puppet	production	+12 -2
Add spark LD_LIBRARY_PATH hints to Yarn NM in Hadoop test	operations/puppet	production	+1 -0
Set minimum memory/vcpu for Yarn Fair schduler in Hadoop Test	operations/puppet	production	+5 -3
Set Apache BigTop for Hadoop test	operations/puppet	production	+15 -0
profile::cdh::apt: add bigtop repository	operations/puppet	production	+13 -5

Related Objects
Search...

Status	Assigned	Task
Resolved	JAllemandou	T168554 Default hive table creation to parquet - needs hive 2.3.0
Resolved	elukey	T203498 Upgrade Hive to ≥ 2.0
Resolved	elukey	T203693 Update to CDH 6 or other up-to-date Hadoop distribution
Resolved	elukey	T273711 Upgrade the Analytics Hadoop cluster to Apache Bigtop
Resolved	elukey	T244499 Upgrade the Hadoop test cluster to BigTop
Duplicate	elukey	T263814 Create temporary cluster to hold a copy of data for backup purposes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 575242 merged by Elukey:
[operations/puppet@production] cdh::hive: improve jar file match regex to work with BigTop

https://gerrit.wikimedia.org/r/575242

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2020, 2:10 PM

Created https://issues.apache.org/jira/browse/BIGTOP-3317 after debugging while the oozie sharedlib create command was failing.

Change 576099 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hive: remove DBTokenStore from hive-site.xml config

https://gerrit.wikimedia.org/r/576099

gerritbot added a project: Patch-For-Review.Mar 2 2020, 5:27 PM

Change 576099 merged by Elukey:
[operations/puppet@production] cdh::hive: remove DBTokenStore from hive-site.xml config

https://gerrit.wikimedia.org/r/576099

Maintenance_bot removed a project: Patch-For-Review.Mar 2 2020, 7:11 PM

Change 577771 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::server: allow to set the sharedlibpath

https://gerrit.wikimedia.org/r/577771

gerritbot added a project: Patch-For-Review.Mar 7 2020, 1:26 PM

Change 577771 merged by Elukey:
[operations/puppet@production] profile::oozie::server: allow to set the sharedlibpath

https://gerrit.wikimedia.org/r/577771

@nshahquinn-wmf Hi! In Hadoop test we are testing the migration from Cloudera's CDH to Apache BigTop, that among the other things ships with Hadoop 2.8 and Hive 2.2.3. Would you be interested in doing some tests? I don't have anything in particular in mind, but I recall that you were passionate about Hive 2 so this is why I am asking :)

In case you want to do a quick test, ssh to an-tool1006 (kinit in there as you do in other places)

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2020, 2:10 PM

The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?

There are two use cases:

we upgrade hdfs and we realize that something is wrong from the first tests, so we rollback. No issue with Hive/Oozie, since their db state didn't change.
we upgrade hdfs, and then we realize only few days afterwards that something is broken and we need to rollback.

Case 2) is challenging since multiple users plus our recurrent jobs would have already changed their state. Rolling back to a previous db status might cause inconsistencies here and there that will be difficult to debug/deal-with. Suggestions/comments?

elukey lowered the priority of this task from High to Medium.Mar 20 2020, 1:15 PM

elukey added a project: Analytics-Kanban.

Today while re-installing the new version of oozie/hadoop packages I experienced a problem that I forgot to fix, namely:

Unpacking oozie (4.3.0-2) ...
dpkg: error processing archive /var/cache/apt/archives/oozie_4.3.0-2_all.deb (--unpack):
 trying to overwrite '/usr/lib/oozie/lib/accessors-smart-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Errors were encountered while processing:
 /var/cache/apt/archives/oozie_4.3.0-2_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

That I manually resolved like:

elukey@analytics1030:~$ sudo dpkg -i --force-overwrite /var/cache/apt/archives/oozie_4.3.0-2_all.deb
(Reading database ... 106775 files and directories currently installed.)
Preparing to unpack .../archives/oozie_4.3.0-2_all.deb ...
Unpacking oozie (4.3.0-2) ...
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/accessors-smart-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/activemq-client-5.13.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/apacheds-i18n-2.0.0-M15.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/apacheds-kerberos-codec-2.0.0-M15.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/api-asn1-api-1.0.0-M20.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/api-util-1.0.0-M20.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/asm-5.0.4.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-cli-1.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-codec-1.4.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/commons-logging-1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/curator-client-2.5.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/curator-framework-2.5.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/geronimo-j2ee-management_1.1_spec-1.0.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/geronimo-jms_1.1_spec-1.1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/guava-11.0.2.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/hawtbuf-1.11.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/httpclient-4.3.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/httpcore-4.3.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jackson-core-asl-1.9.13.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jackson-mapper-asl-1.9.13.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jcip-annotations-1.0-1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jline-0.9.94.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/json-simple-1.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/json-smart-2.3.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/jsr305-1.3.9.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/netty-3.7.0.Final.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/nimbus-jose-jwt-4.41.1.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/oozie-client-4.3.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/oozie-hadoop-auth-hadoop-2-4.3.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/slf4j-api-1.6.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/slf4j-simple-1.6.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/xercesImpl-2.10.0.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/xml-apis-1.4.01.jar', which is also in package oozie-client 4.3.0-2
dpkg: warning: overriding problem because --force enabled:
dpkg: warning: trying to overwrite '/usr/lib/oozie/lib/zookeeper-3.4.6.jar', which is also in package oozie-client 4.3.0-2
dpkg: dependency problems prevent configuration of oozie:
 oozie depends on oozie-client (= 4.3.0-2); however:
  Package oozie-client is not configured yet.

dpkg: error processing package oozie (--install):
 dependency problems - leaving unconfigured
Processing triggers for systemd (232-25+deb9u12) ...
Errors were encountered while processing:
 oozie

elukey@analytics1030:~$ sudo apt-get install oozie -f
Reading package lists... Done
Building dependency tree
Reading state information... Done
oozie is already the newest version (4.3.0-2).
The following packages were automatically installed and are no longer required:
  blt libprotobuf10 libssl1.0.0 linux-image-4.9.0-8-amd64 net-tools python-backports-shutil-get-terminal-size python-cycler python-enum34 python-etcd python-funcsigs python-functools32 python-ipython python-ipython-genutils python-joblib
  python-jsonschema python-lxml python-maxminddb python-mock python-mpmath python-pathlib2 python-pbr python-pickleshare python-prompt-toolkit python-protobuf python-pyparsing python-simplegeneric python-subprocess32 python-traitlets python-tz
  python-wcwidth ruby-nokogiri ruby-pkg-config ruby-rgen tk8.6-blt2.5
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
No directory, logging in with HOME=/
INFO:debmonitor:Got 0 updates from dpkg hook version 3
INFO:debmonitor:Successfully sent the dpkg_hook update to the DebMonitor server
Setting up oozie-client (4.3.0-2) ...
Setting up oozie (4.3.0-2) ...
update-alternatives: using /etc/oozie/tomcat-conf.http to provide /etc/oozie/tomcat-conf (oozie-tomcat-conf) in auto mode
Processing triggers for man-db (2.7.6.1-2) ...
Processing triggers for systemd (232-25+deb9u12) ...

So it seems that oozie's packages are conflicting due to both providing an overlapping set of /usr/lib/oozie/lib jars..

Opened https://issues.apache.org/jira/browse/BIGTOP-3330

Mentioned in SAL (#wikimedia-operations) [2020-03-23T11:27:31Z] <elukey> upload oozie 4.3.0-3 to thirparty/bigtop14 on wikimedia-stretch - T244499

In T244499#5986989, @elukey wrote:

Opened https://issues.apache.org/jira/browse/BIGTOP-3330

Fixed, deployed and tested.

In T244499#5986835, @elukey wrote:

The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?

There are two use cases:

we upgrade hdfs and we realize that something is wrong from the first tests, so we rollback. No issue with Hive/Oozie, since their db state didn't change.

we upgrade hdfs, and then we realize only few days afterwards that something is broken and we need to rollback.

Case 2) is challenging since multiple users plus our recurrent jobs would have already changed their state. Rolling back to a previous db status might cause inconsistencies here and there that will be difficult to debug/deal-with. Suggestions/comments?

After a chat with Joseph we agreed that case 2) could be handled simply allowing a limited amount of time for the HDFS upgrade-before-finalize time, and warn people that the state of hive/oozie might be rolledback during that timeframe.

Change 583065 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Avoid overriding Hadoop's core files to allow IPv6

https://gerrit.wikimedia.org/r/583065

gerritbot added a project: Patch-For-Review.Mar 24 2020, 1:33 PM

Change 583065 merged by Elukey:
[operations/puppet@production] Avoid overriding Hadoop's core files to allow IPv6

https://gerrit.wikimedia.org/r/583065

Change 583069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Restore CDH settings for Hadoop Test

https://gerrit.wikimedia.org/r/583069

Change 583069 merged by Elukey:
[operations/puppet@production] Restore CDH settings for Hadoop Test

https://gerrit.wikimedia.org/r/583069

Working on the rollback in https://etherpad.wikimedia.org/p/analytics-bigtop

Maintenance_bot removed a project: Patch-For-Review.Mar 24 2020, 4:12 PM

Change 583303 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set maximum failover retry attempts for HDFS in Hadoop Test

https://gerrit.wikimedia.org/r/583303

gerritbot added a project: Patch-For-Review.Mar 25 2020, 10:20 AM

Change 583303 merged by Elukey:
[operations/puppet@production] Set maximum failover retry attempts for HDFS in Hadoop Test

https://gerrit.wikimedia.org/r/583303

Maintenance_bot removed a project: Patch-For-Review.Mar 25 2020, 11:11 AM

The first attempt of rollback was a disaster, I wasn't able to restore HDFS to its previous state.

From the documentation it seemed possible to rollback the state of HDFS after the upgrade but before having finalized it. Today I tried but the HDFS namenodes refused to comply, erroring out problems about gaps in the edit log. After a bit of research, and comparing the errors with edit/fsimage file names (since they contain range of transactions) I came to the conclusion that with QJM (journal nodes) this kind of rollback is difficult or not possible. The main problem is that the Namenodes create a fsimage to rollback to, but then after weeks the gap between what it is stored in the edit log and the last transaction of the rollback fsimage is too big (since it has been folded to new fsimages). This means that when the Namenode tries to rollback, it reads the rollback fsimage and tries to pull the missing edits from the edit log, but they are not there anymore.

I also tried to perform another rollback, the one that avoids to preserve the data generated/added to HDFS between the upgrade and the rollback. It didn't work either since apparently doing a rolling upgrade didn't create the necessary fsimages where the namenode expected them. I haven't tried to manually move the fsimages for the rolling upgrade in different fs locations (just thought about it now), but it wouldn't have been super clean anyway.

From https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html it seems that we should attempt a regular upgrade (not a rolling upgrade), with the caveat that all the data between upgrade and rollback might be lost. This seems to be the only way with a QJM, even if if feels a little bit strange.

The other option is that the rolling upgrade works with QJM but only if the gap between upgrade and rollback is limited (so the editlog is kept available), but I didn't see any trace of this in the docs. I'll try to follow up with the bigtop mailing list to see if anybody had the same experience.

elukey moved this task from Next Up to Paused on the Analytics-Kanban board.Apr 6 2020, 2:36 PM

As reference, I filed https://issues.apache.org/jira/browse/BIGTOP-3341 to ask for complete Openssl 1.1.1 support in BigTop 1.5 (so when we'll migrate to Buster no issue should arise).

elukey updated the task description. (Show Details)May 25 2020, 11:48 AM

Change 598450 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set BigTop repository config for the Hadoop Test cluster

https://gerrit.wikimedia.org/r/598450

gerritbot added a project: Patch-For-Review.May 25 2020, 12:07 PM

Change 598450 merged by Elukey:
[operations/puppet@production] Set BigTop repository config for the Hadoop Test cluster

https://gerrit.wikimedia.org/r/598450

Maintenance_bot removed a project: Patch-For-Review.May 25 2020, 1:10 PM

Upgrade a second time and failed with a different issue. This time, I ended up with a lot of missing/under-replicated blocks and also ~7% of the total ones corrupted.

Judging from https://docs.cloudera.com/documentation/enterprise/5-15-x/topics/cdh_ig_earlier_cdh5_upgrade.html#topic_6_3_10 I think that I haven't waited long enough before moving from the primary namenode bootstrap to the secondary, and then to the datanodes.

Will attempt a rollback and a roll forward :(

Change 598693 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set CDH repository back for the Hadoop test cluster

https://gerrit.wikimedia.org/r/598693

gerritbot added a project: Patch-For-Review.May 26 2020, 8:58 AM

Change 598693 merged by Elukey:
[operations/puppet@production] Set CDH repository back for the Hadoop test cluster

https://gerrit.wikimedia.org/r/598693

Maintenance_bot removed a project: Patch-For-Review.May 26 2020, 9:10 AM

I was able to rollback successfully but the caveat was that even the datanodes need to be rolledback. When the upgrade command is issued to the namenode, it does two things:

saves a copy of the fs image as "previous" in a know location │··
tells to all the datanodes to do the same, using hard-links (under /var/lib/hadoop/data/$letter/dn/current/BP-etc../ one can see a previous and a current directory, normally we have only current)

Then there are two possibilities:

finalize the upgrade, so the previous state is discarded │··
rollback, the current state is discarded

elukey moved this task from Paused to In Progress on the Analytics-Kanban board.May 27 2020, 7:06 AM

elukey moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.Jun 10 2020, 2:21 PM

elukey mentioned this in T255142: Upgrade the Hadoop Analytics cluster to BigTop.Jun 11 2020, 1:37 PM

Change 605858 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Bigtop for Hadoop test

https://gerrit.wikimedia.org/r/605858

gerritbot added a project: Patch-For-Review.Jun 16 2020, 10:14 AM

Change 605858 merged by Elukey:
[operations/puppet@production] Set Bigtop for Hadoop test

https://gerrit.wikimedia.org/r/605858

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2020, 11:10 AM

Today I was able to rollback BigTop, and since the previous attempt to rollout went fine, this is the first time that we do back and forth without corrupting HDFS. To avoid loosing any important data:

Upgrade from CDH to BigTop

=== safety steps ===

- merge the puppet change to use the BigTop repo and run it everywhere (so we'll have packages ready toinstall with puppet disabled later on)

- disable puppet on all hosts
  sudo cumin 'analytics10[28-41]*  or an-tool1006*' 'disable-puppet "elukey - upgrading to bigtop"' 

- Stop Oozie and Hive, + timers
   sudo cumin 'analytics1030*' 'systemctl stop oozie'
   sudo cumin 'analytics1030*' 'systemctl stop hive-server2'
   sudo cumin 'analytics1030*' 'systemctl stop hive-metastore'
   sudo cumin 'analytics1030*' 'systemctl stop *.timer'
  sudo cumin 'analytics1030*' 'systemctl stop presto-server'

- unmount /mnt/hdfs
  sudo cumin 'an-tool1006*' 'umount /mnt/hdfs'
  sudo cumin 'analytics1030*' 'umount /mnt/hdfs'

- Stop all daemons like Hue, Jupyter, etc..
   sudo cumin 'analytics1039*' 'systemctl stop hue'
   sudo cumin 'an-tool1006*' 'systemctl stop jupyterhub' (do we need to stop also all notebooks for prod?)

- Check running jobs on workers
  sudo cumin 'A:hadoop-worker-test' 'ps aux | grep java| egrep -v "JournalNode|DataNode|NodeManager"'

- Check HDFS Active/Standby
   sudo cumin 'analytics1028*' 'sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState analytics1028-eqiad-wmnet'
   sudo cumin 'analytics1028*' 'sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState analytics1029-eqiad-wmnet'

- enter HDFS Safe mode
   sudo cumin 'analytics1028*' 'sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter'
   sudo cumin 'analytics1028*' 'sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace'

- Backup namenode dir on the HDFS master node
   sudo cumin 'analytics1028*' 'cd /var/lib/hadoop/name && tar -cvf /root/hadoop-namedir-backup-bigtop-upgrade-$(date +%s).tar .'

- backup separately each database that we are interested in (like hive_metastore, oozie, etc..). One giant backup is more difficult to restore.
sudo cumin 'analytics1030*' 'mysqldump hive_metastore > oozie_$(date +%s).sql'
sudo cumin 'analytics1030*' 'mysqldump oozie > oozie_$(date +%s).sql'

- Don't upgrade Hue to the new packages, use the CDH ones for the moment.

== Procedure ==

Described in https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html for HDFS
Stop the whole cluster as usual:
    - Yarn node managers + Yarn RM first
       sudo cumin 'A:hadoop-worker-test' 'systemctl stop hadoop-yarn-nodemanager'
       sudo cumin 'analytics1028*' 'systemctl stop hadoop-yarn-resourcemanager'
       sudo cumin 'analytics1029*' 'systemctl stop hadoop-yarn-resourcemanager'
    - all HDFS datanodes
       sudo cumin 'A:hadoop-worker-test' 'systemctl stop hadoop-hdfs-datanode' -b 1 -s 60
    - Secondary NN, Active NN down
      sudo cumin 'analytics1029*' 'systemctl stop hadoop-hdfs-namenode'
      sudo cumin 'analytics1029*' 'systemctl stop hadoop-hdfs-zkfc'
      sudo cumin 'analytics1028*' 'systemctl stop hadoop-mapreduce-historyserver'
    - JournalNodes
      sudo cumin 'A:hadoop-hdfs-journal-test' 'systemctl stop hadoop-hdfs-journalnode' -b 1 -s 60

- Run ps -aux | grep java across all nodes to review if there are jvms running and if it is ok (Druid for example is ok to keep running).

- Remove Yarn zookeeper znodes:
    setAcl /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot world:anyone:cdrwa
    rmr /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot

- On all worker nodes:
    sudo cumin 'A:hadoop-worker-test' 'rm -rf /tmp/hadoop-yarn/*'
    sudo cumin 'A:hadoop-worker-test' "dpkg -l | grep cdh | awk '{print \$2}' | tr '\n' ' ' > /home/elukey/cdh_package_list"
     sudo cumin 'A:hadoop-worker-test' "apt-get remove -y \$(cat /home/elukey/cdh_package_list)"
     sudo cumin 'A:hadoop-worker-test' "apt-cache policy hadoop"
     sudo cumin 'A:hadoop-hdfs-journal-test' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`' -b 1 -s 60
     sudo cumin 'A:hadoop-worker-test and not A:hadoop-hdfs-journal-test'  'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`' -b 1 -s 60

- At this point, Yarn NM and Hadoop JN/DN are all up.
   sudo cumin 'A:hadoop-worker-test' 'ps aux | grep java| egrep "JournalNode|DataNode|NodeManager" | grep -v egrep| wc -l'

- Upgrade packages on 1001
sudo cumin 'analytics102*' "dpkg -l | grep cdh | awk '{print \$2}' | tr '\n' ' ' > /home/elukey/cdh_package_list"
sudo cumin 'analytics102*' "apt-get remove -y \$(cat /home/elukey/cdh_package_list)"
sudo cumin 'analytics1028*' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`'

- Start the NN on 1001 with the -upgrade flag:
    sudo service hadoop-hdfs-namenode upgrade
- Start Yarn and the history server without any particular flag

Upgrade the packages on 1002

sudo cumin 'analytics1029*' 'apt-get install -y `cat /home/elukey/cdh_package_list | tr " " "\n" | egrep -v "avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry" | tr "\n" " "`'
- Start the NN on 1002 with the -bootstrapStandy flag:
    sudo cumin 'analytics1029*' 'sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs namenode -bootstrapStandby'
    sudo cumin 'analytics1029*' 'systemctl start hadoop-hdfs-namenode'
-  Start Yarn without any particular flag

Upgrade the Hive metastore:
    /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 1.1.0


Rollback Bigtop to CDH:

sudo cumin 'analytics1028*' 'echo Y | sudo -u hdfs kerberos-run-command hdfs hdfs namenode -rollback'
sudo cumin -m async 'A:hadoop-worker-test' 'systemctl stop hadoop-hdfs-datanode' 'service hadoop-hdfs-datanode rollback' 'systemctl start hadoop-hdfs-datanode' -b 1 -s 30

Change 606736 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] WIP - Add sre.hadoop.change-distro.py

https://gerrit.wikimedia.org/r/606736

gerritbot added a project: Patch-For-Review.Jun 19 2020, 5:08 PM

Change 606736 merged by Elukey:
[operations/cookbooks@master] hadoop - Add change-distro.py and stop-cluster.py

https://gerrit.wikimedia.org/r/606736

Maintenance_bot removed a project: Patch-For-Review.Jun 25 2020, 8:11 AM

Change 609436 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.stop-cluster.py: fix minor errors/details

https://gerrit.wikimedia.org/r/609436

Change 609436 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.stop-cluster.py: fix minor errors/details

https://gerrit.wikimedia.org/r/609436

Change 609442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro.py: fix misc details

https://gerrit.wikimedia.org/r/609442

Change 609442 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro.py: fix misc details

https://gerrit.wikimedia.org/r/609442

Change 609452 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set BigTop for Hadoop master/standby/worker nodes.

https://gerrit.wikimedia.org/r/609452

Change 609452 merged by Elukey:
[operations/puppet@production] Set BigTop for Hadoop master/standby/worker nodes.

https://gerrit.wikimedia.org/r/609452

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Change 609975 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro: improve procedure and logging

https://gerrit.wikimedia.org/r/609975

Change 609975 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro: improve procedure and logging

https://gerrit.wikimedia.org/r/609975

The cookbooks seem to run fine, but sometimes during rollback I get instances of the following problems on journalnodes:

2020-07-08 14:44:50,538 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.64.36.128: This node has namespaceId '0 and clusterId '' but the requesting node expected '2082959117' and 'CID-a9158735-e6ad-4da1-8caf-897f0d650a79'

The Namenode, right after the rollback command, cannot start due to the journal nodes being in a weird state, like they don't recognize their /var/lib/hadoop/journal directory content (when they log this the VERSION file contains namespace and clusterId as the namenode expects). After a few restarts, they recognize the edit files and everything starts working.

These are the logs when the journal nodes are in the weird state:

2020-07-08 14:38:02,145 INFO org.mortbay.log: Started SslSocketConnectorSecure@0.0.0.0:8481
2020-07-08 14:38:02,188 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
2020-07-08 14:38:02,206 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2020-07-08 14:38:02,463 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-07-08 14:38:02,463 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting
2020-07-08 14:40:15,954 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:40:16,127 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /var/lib/hadoop/journal/analytics-test-hadoop
2020-07-08 14:40:16,146 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /var/lib/hadoop/journal/analytics-test-hadoop/in_use.lock acquired by nodename 34988@analytics1038
2020-07-08 14:40:16,161 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 14:40:16,310 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 14:40:16,311 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.
2020-07-08 14:44:22,025 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:44:22,035 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.
2020-07-08 14:44:46,942 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 14:44:50,538 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.64.36.128: This node has namespaceId '0 and clusterId '' but the requesting node expected '2082959117' and 'CID-a9158735-e6ad-4da1-8caf-897f0d650a79'

Meanwhile the following logs are related to a good state (that allows the Namenode to start):

2020-07-08 14:59:26,975 INFO org.mortbay.log: Started SslSocketConnectorSecure@0.0.0.0:8481
2020-07-08 14:59:27,018 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
2020-07-08 14:59:27,036 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2020-07-08 14:59:27,314 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-07-08 14:59:27,314 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting
2020-07-08 15:00:08,067 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 15:00:08,210 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /var/lib/hadoop/journal/analytics-test-hadoop
2020-07-08 15:00:08,232 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /var/lib/hadoop/journal/analytics-test-hadoop/in_use.lock acquired by nodename 36134@analytics1038
2020-07-08 15:00:08,262 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 15:00:08,393 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 15:00:13,156 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/analytics1028.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2020-07-08 15:00:13,224 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastPromisedEpoch from 34 to 35 for client /10.64.36.128
2020-07-08 15:00:13,230 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/var/lib/hadoop/journal/analytics-test-hadoop)
2020-07-08 15:00:13,279 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false)
2020-07-08 15:00:13,351 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: getSegmentInfo(6330469): EditLogFile(file=/var/lib/hadoop/journal/analytics-test-hadoop/current/edits_0000000000006330469-0000000000006330469,first=0000000000006330469,last=0000000000006330469,inProgress=false,hasCorruptHeader=false) -> startTxId: 6330469 endTxId: 6330469 isInProgress: false

The main difference seems to be:

2020-07-08 14:40:16,311 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Storage directory /var/lib/hadoop/journal/analytics-test-hadoop does not contain previous fs state.

There is no mention in tutorials and in the init.d files about the possibility of rolling back a journalnode's state, maybe there is a step missing in the procedure?

I tried again an upgrade and checked one of the journal nodes, finding a previous state:

elukey@analytics1031:~$ ls /var/lib/hadoop/journal/analytics-test-hadoop/
current  in_use.lock  previous

I didn't notice it when trying to fix the error during rollback, so the journal nodes must go through the rollback process by themselves and/or they are guided by the Namenodes.

Change 610336 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop: add logging and more backup actions

https://gerrit.wikimedia.org/r/610336

Change 610336 merged by Elukey:
[operations/cookbooks@master] sre.hadoop: add logging and more backup actions

https://gerrit.wikimedia.org/r/610336

Change 610721 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro: modify restart procedure and remove previous state

https://gerrit.wikimedia.org/r/610721

Change 610721 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro: modify restart procedure and remove previous state

https://gerrit.wikimedia.org/r/610721

I've done another round of rollout/rollback, and I found the following interesting log:

2020-07-09 08:52:37,907 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Rollback of /var/lib/hadoop/journal/analytics-test-hadoop is complete.

This happens only when we issue the NN rollback command, the previous state is not removed before that (package downgrade, restarts, etc..). I found that stopping completely the JNs after the package downgrade, and then start them again helps with the spurious bug aforementioned. Updated the cookbooks, will do another round of tests to verify that now it is better.

elukey moved this task from Q4 2019/2020 to Q1 2020/2021 on the Analytics-Clusters board.Jul 10 2020, 8:52 AM

Change 611392 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.change-distro.py: change logic for JN roll restart

https://gerrit.wikimedia.org/r/611392

Change 611392 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.change-distro.py: change logic for JN roll restart

https://gerrit.wikimedia.org/r/611392

Instructions to upgrade the coordinator (or where hive/oozie run):

dpkg -l | awk '/+cdh/ {print $2}' | tr '\n' ' ' > /root/cdh_package_list
apt-get remove -y `cat /root/cdh_package_list`
apt-get install -y `cat /root/cdh_package_list | tr ' ' '\n' | egrep -v 'avro-libs|hadoop-0.20-mapreduce|kite|parquet|parquet-format|sentry|flume-ng|pig' | tr '\n' ' '`
/usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 1.1.0

For the client (an-tool1006) it is sufficient to remove the cdh packages, run puppet and restart jupyterhub (we'll do something similar for the stat boxes probably).

Change 619466 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] hive: quote all usages of percent/range words

https://gerrit.wikimedia.org/r/619466

elukey@analytics1028:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -finalizeUpgrade
Finalize upgrade successful for analytics1028.eqiad.wmnet/10.64.36.128:8020
Finalize upgrade successful for analytics1029.eqiad.wmnet/10.64.36.129:8020

Took few seconds, no issues registered. Probably in a bigger and more crowded HDFS environment it will take more time.

These are the logs that I found on the HDFS active namenode:

[...]
2020-08-11 08:50:41,450 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.36.133:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-62fe8cb8-8071-488c-ab73-55ddfaae71a6:NORMAL:10.64.53.17:50010|RBW], ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED]]} size 0
2020-08-11 08:50:41,451 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.53.17:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-0c8b6382-d655-4fb7-96c6-5a9282e58812:NORMAL:10.64.53.17:50010|FINALIZED]]} size 0
2020-08-11 10:15:45,468 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /var/lib/hadoop/name/current
2020-08-11 14:14:48,759 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/name is complete.

And on the standby:

[...]
2020-08-11 08:51:44,504 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.64.53.17:50010 is added to blk_1074633847_893023{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-1a83f355-ef7a-44ca-b3c4-23bc90188567:NORMAL:10.64.36.134:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-8f8cdb06-d494-48de-95b3-8452038beb40:NORMAL:10.64.36.133:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-0c8b6382-d655-4fb7-96c6-5a9282e58812:NORMAL:10.64.53.17:50010|FINALIZED]]} size 0
2020-08-11 14:14:49,402 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/name is complete.

And on one journal node:

[...]
2020-07-14 11:04:53,318 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Validating log segment /var/lib/hadoop/journal/analytics-test-hadoop/current/edits_inprogress_0000000000006341117 about to be finalized
2020-08-11 10:15:45,398 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Validating log segment /var/lib/hadoop/journal/analytics-test-hadoop/current/edits_inprogress_0000000000007483874 about to be finalized
2020-08-11 14:14:49,217 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Finalize upgrade for /var/lib/hadoop/journal/analytics-test-hadoop is complete.

Change 619466 merged by Nuria:
[analytics/refinery@master] hive: quote all usages of percent/range words

https://gerrit.wikimedia.org/r/619466

Spark2 seems not able to run any Yarn jobs, I see the following in the application's logs:

20/08/12 07:22:09 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/08/12 07:22:09 WARN ScriptBasedMapping: Exception running /etc/hadoop/conf.analytics-test-hadoop/net-topology.sh 10.64.53.15
ExitCodeException exitCode=143:
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:575)
	at org.apache.hadoop.util.Shell.run(Shell.java:478)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
	at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
	at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
	at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
	at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
	at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
	at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:37)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:422)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:421)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:421)

And on the Yarn node manager:

Full command array for failed execution:
[/usr/lib/hadoop-yarn/bin/container-executor, analytics, analytics, 1, application_1597140872253_0524, container_1597140872253_0524_01_000002, /var/lib/hadoop/data/g/yarn/local/usercache/analy
tics/appcache/application_1597140872253_0524/container_1597140872253_0524_01_000002, /var/lib/hadoop/data/m/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_
000002/launch_container.sh, /var/lib/hadoop/data/g/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_000002/container_1597140872253_0524_01_000002.tokens, /var/lib/hadoop/data/m/yarn/local/nmPrivate/application_1597140872253_0524/container_1597140872253_0524_01_000002/container_1597140872253_0524_01_000002.pid, /var/lib/hadoop/data/b/yarn/local%/var/lib/hadoop/data/e/yarn/local%/var/lib/hadoop/data/g/yarn/local%/var/lib/hadoop/data/i/yarn/local%/var/lib/hadoop/data/j/yarn/local%/var/lib/hadoop/data/l/yarn/local%/var/lib/hadoop/data/m/yarn/local, /var/lib/hadoop/data/b/yarn/logs%/var/lib/hadoop/data/e/yarn/logs%/var/lib/hadoop/data/g/yarn/logs%/var/lib/hadoop/data/i/yarn/logs%/var/lib/hadoop/data/j/yarn/logs%/var/lib/hadoop/data/l/yarn/logs%/var/lib/hadoop/data/m/yarn/logs, cgroups=none]
2020-08-12 08:10:33,467 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception:
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=143:
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:107)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:130)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:395)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

It seems as if the SparkRackResolver on every worker tries to run /etc/hadoop/conf.analytics-test-hadoop/net-topology.sh getting an error from the container and/or the file system.

After a lot of debugging, the issue seemed to be related to the AM getting killed for too much virtual memory used:

2020-08-12 14:39:51,594 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=17413,containerID=container_1597140872253_0672_04_000001] is running beyond virtual memory limits. Current usage: 338.7 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
[..]
Full command array for failed execution:
[/usr/lib/hadoop-yarn/bin/container-executor, analytics, analytics, 1, application_1597140872253_0672, container_1597140872253_0672_04_000001, /var/lib/hadoop/data/k/yarn/local/usercache/analytics/appcache/application_1597140872253_0672/container_1597140872253_0672_04_000001, /var/lib/hadoop/data/k/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/launch_container.sh, /var/lib/hadoop/data/h/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/container_1597140872253_0672_04_000001.tokens, /var/lib/hadoop/data/i/yarn/local/nmPrivate/application_1597140872253_0672/container_1597140872253_0672_04_000001/container_1597140872253_0672_04_000001.pid, /var/lib/hadoop/data/b/yarn/local%/var/lib/hadoop/data/c/yarn/local%/var/lib/hadoop/data/d/yarn/local%/var/lib/hadoop/data/e/yarn/local%/var/lib/hadoop/data/f/yarn/local%/var/lib/hadoop/data/g/yarn/local%/var/lib/hadoop/data/h/yarn/local%/var/lib/hadoop/data/i/yarn/local%/var/lib/hadoop/data/j/yarn/local%/var/lib/hadoop/data/k/yarn/local%/var/lib/hadoop/data/l/yarn/local%/var/lib/hadoop/data/m/yarn/local, /var/lib/hadoop/data/b/yarn/logs%/var/lib/hadoop/data/c/yarn/logs%/var/lib/hadoop/data/d/yarn/logs%/var/lib/hadoop/data/e/yarn/logs%/var/lib/hadoop/data/f/yarn/logs%/var/lib/hadoop/data/g/yarn/logs%/var/lib/hadoop/data/h/yarn/logs%/var/lib/hadoop/data/i/yarn/logs%/var/lib/hadoop/data/j/yarn/logs%/var/lib/hadoop/data/k/yarn/logs%/var/lib/hadoop/data/l/yarn/logs%/var/lib/hadoop/data/m/yarn/logs, cgroups=none]
2020-08-12 14:39:51,601 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception:

Since spark2-shell and eventlogging_to_druid_navigationtiming_hourly in hadoop test are both using spark in client mode (the latter due to better alarming with timers/icinga), the following resolved the problem: --conf spark.yarn.am.memory=2g (by default is 512m).

My speculation is that on hadoop 2.8.5 there are more libs etc.. loaded in memory for the same container, that added on top of the spark ones get to the container's limits. Adding spark.yarn.am.memory=2g to spark-defaults seems to be good from my perspective.

Change 619788 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test

https://gerrit.wikimedia.org/r/619788

Change 619788 merged by Elukey:
[operations/puppet@production] Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test

https://gerrit.wikimedia.org/r/619788

yarn.nodemanager.vmem-pmem-ratio set to 5.1 (5 to 1, default 2 to 1) seems also to solve the issue!

After the last change, I have tested (from an-tool1006):

pyspark and spark shells
refine and druid indexation timers (all spark based, the former running in cluster mode and the latter in client mode)
pyspark via notebooks

Everything looking good, no errors registered. I have also checked that Druid indexations were successful, no error reported on that side as well.

Overall it seems that there is no blocker with Bigtop, everything works as expected. The next step will be to have Joseph to do another round of tests to find what I have missed :)

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Sep 30 2020, 8:26 AM

• Nuria closed this task as Resolved.Sep 30 2020, 11:29 PM

nshahquinn-wmf added a parent task: T273711: Upgrade the Analytics Hadoop cluster to Apache Bigtop.Feb 11 2021, 2:26 PM

nshahquinn-wmf removed a parent task: T203693: Update to CDH 6 or other up-to-date Hadoop distribution.

odimitrijevic mentioned this in T316572: Triage and Report on Unique Devices Data Issue.Aug 29 2022, 5:15 PM

elukey mentioned this in rCCKBd7f6eb02103b: sre.hadoop.stop-cluster.py: fix minor errors/details.Dec 14 2022, 3:25 PM

elukey mentioned this in rCCKBc6075feda7d0: hadoop - Add change-distro.py and stop-cluster.py.

elukey mentioned this in rCCKB8cf9e5ccaec0: sre.hadoop.change-distro.py: fix misc details.

elukey mentioned this in rCCKB0ba4da16a953: sre.hadoop.change-distro: improve procedure and logging.

elukey mentioned this in rCCKBe0b0003f5733: sre.hadoop: add logging and more backup actions.

elukey mentioned this in rCCKBc74184369edd: sre.hadoop.change-distro: modify restart procedure and remove previous state.

elukey mentioned this in rCCKB34cc1cbd9070: sre.hadoop.change-distro.py: change logic for JN roll restart.