Upgrade Analytics Cluster to Trusty, and then to CDH 5.3
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Nov 10 2014, 4:53 PM

Description

CDH 5.3 packages are now available for Trusty, woot! I'd like to upgrade all analytics machines to Trusty first, and then upgrade to CDH 5.3. I will use this task to track and plan work on both. I would like to do-release-upgrade these, as I have had a good experience doing this on stat* boxes, and it will much simpler to not have to reinstall the OS for these nodes.

There are several services in the Analytics Cluster, and upgrades to those should be planned and done as units. Let's start from easiest to hardest.

analytics1003 and analytics1026 don't host any (real) production critical services. Let's do those first. I should be able to upgrade those without much fanfare. Once those are done, let's do analytics1027. Once that is done, then we will discuss the next steps.

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	Ottomata	T83601 Better way to access Hadoop related web GUIs
Resolved	Ottomata	T78016 Eng uses Mahout installed on Hadoop cluster
Resolved	Ottomata	T1200 Upgrade Analytics Cluster to Trusty, and then to CDH 5.3

Event Timeline

Ottomata created this task.Nov 10 2014, 4:53 PM

Ottomata claimed this task.

Ottomata raised the priority of this task from to Medium.

Ottomata updated the task description. (Show Details)

Ottomata added projects: acl*sre-team, Analytics.

Ottomata changed Security from none to None.

Ottomata added subscribers: Ottomata, QChris, • Gage.

Liuxinyu970226 subscribed.Nov 11 2014, 12:43 AM

I'm starting the upgrade of analytics1026 now. Process:

schedule downtime in icinga
disable pupppet
start udp2log instance running sqstat on stat1002

/usr/bin/udp2log --config-file=/home/otto/udp2log.sqstat.conf --pid-file /tmp/udp2log.sqstat.pid -p 8420 --multicast 233.58.59.1 --recv-queue=524288

stop udp2log on analytics1026

Do the upgrade

apt-get update
apt-get upgrade
apt-get install update-manager-core
export http_proxy=http://webproxy.eqiad.wmnet:8080
do-release-upgrade

stop udp2log on stat1002
make sure udp2log and sqstat are running on analytics1026

• chasemp awarded a token.Nov 13 2014, 5:10 PM

analytics1026 is done.

Starting on analytics1003. Same plan as above, but without the udp2log bits. There are no prod services on analytics1003.

analytics1003 is done.

Going to start on Zookeepers now: analytics1023,1024,1025. Will do them in that order, one at a time.

Process is simple, just do the upgrade and reboot.

I just finished analytics1023, but accidentally upgraded from zookeeper 3.3.5 -> 3.4.5 in the process. I had forgotten that we don't use the CDH zookeeper package, but the ones from Ubuntu/Debian. So, an apt-get upgrade upgraded it.

Everything looks fine though, but just in case I will wait until tomorrow before I continue with the other 2.

Things are looking fine. I'm going to continue with analytics1024 and then analytics1025. Once they are upgraded I will modify puppet so it is happy again.

analytics1024 is in progress.

Something strange happened with analytics1003 (a Cisco) overnight. It was down this morning, and console showed it stuck on some reboot loop. I power cycled and it came back up fine. However, there was a warning about Diskfilter writes are not supported on boot. I googled, and came up with this:

http://askubuntu.com/questions/468466/why-this-occurs-error-diskfilter-writes-are-not-supported/498281#498281, which eventually leads to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754921.

I applied this patch on analytics1003 and rebooted, and was no longer saw this warning.

I just noticed the same diskfilter warning on analytics1024. I think we should apply this to each node after we upgrade. I've attached the patched 00_header file.

grub_header_patch.debian-bug.7549219 KBDownload

Steps are to download this file and then:

mv /etc/grub.d/00_header /etc/grub.d/00_header.debian-bug-754921.orig
mv grub_header_patch.debian-bug.754921 /etc/grub.d/00_header
chmod -x /etc/grub.d/00_header.debian-bug-754921.orig; chmod +x /etc/grub.d/00_header
update-grub

analytics1024 is done, moving on to analytics1025.

analytics1025 is done.

Kafkas are next: analytics1012, 1018, 1021, 1022. Will do them one at a time in that order.

Actually, on second thought, there is no immediate need to upgrade the Kafka servers to Trusty, and to do so I would have to build a Trusty version of the .deb.

Kafka 0.8.2 will be released very soon. To minimize work, I think I will wait until 0.8.2 is out and coordinate the upgrade to Trusty with the upgrade of Kafka itself.

Next up is the Hadoop cluster then. Since it is Friday, I will wait until next week before continuing.

I will continue with the upgrade to Trusty for Hadoop Workers this week. I'm starting on them now, beginning with analytics1013. The process is:

apt-get update
apt-get upgrade
apt-get install update-manager-core
export http_proxy=http://webproxy.eqiad.wmnet:8080
do-release-upgrade

I did analytics1013 and 1014 yesterday. Starting with 1015 today.

Done today:

Something is weird with 1033's console. I will come back to it.

I noticed 1033's console problem too, and created an RT ticket: https://rt.wikimedia.org/Ticket/Display.html?id=8858

Completed today:

I need to figure out what's up with 1033's console. Then I can proceed with analytics1027. Then I need to think about namenodes. I'm worried about those Ciscos!

Ottomata raised the priority of this task from Medium to High.Dec 1 2014, 3:28 PM

Since this has been bumped in priority, what do we do with failing nodes as analytics1003?

Uh, maybe this is normal priority. :p

I am waiting for Chris Johnson to take a look.

Ottomata lowered the priority of this task from High to Medium.Dec 2 2014, 2:03 PM

Waiting to catch another kernel panic on analytics1003, and then I will attempt a full reinstall. Current panic output is:

CISCO Serial Over LAN:
Close Network Connection to Exit
149325] BUG: unable to handle kernel NULL pointer dereference at 00000000000001be
[107649.158194] IP: [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.165022] PGD 2fe614a067 PUD 2fe623e067 PMD 0
[107649.170317] Oops: 0000 [#12] SMP
[107649.174143] Modules linked in: 8021q garp stp mrp llc gpio_ich intel_powerclamp coretemp joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i7core_edac ioatdma edac_core lpc_ich dca acpi_power_meter ipmi_si mac_hid lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 hid_generic lpfc mptsas raid0 mptscsih multipath scsi_transport_fc usbhid mptbase linear usb_storage hid scsi_transport_sas scsi_tgt bnx2
[107649.228195] CPU: 21 PID: 32765 Comm: diamond Tainted: G      D   I   3.13.0-39-generic #66-Ubuntu
[107649.238200] Hardware name: Cisco Systems Inc R250-2480805/R250-2480805, BIOS C250.1.4.1b.0.080420110259 08/04/2011
[107649.249853] task: ffff882feafab000 ti: ffff882fe8e92000 task.ti: ffff882fe8e92000
[107649.258305] RIP: 0010:[<ffffffff811a2285>]  [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.267823] RSP: 0018:ffff882fe8e93ec8  EFLAGS: 00010286
[107649.273852] RAX: 0000000000000000 RBX: ffffffffffffffea RCX: 0000000000190ffa
[107649.281916] RDX: 0000000000190ff9 RSI: 00000000000000d0 RDI: ffff88180f803800
[107649.289971] RBP: ffff882fe8e93ef8 R08: 0000000000017280 R09: ffffffff81090826
[107649.298034] R10: 0000000000001fc0 R11: 0000000000000246 R12: 00000000000001be
[107649.306098] R13: 00000000000000d0 R14: ffff88180f803800 R15: ffff88180f803800
[107649.314163] FS:  00007f335ddf0700(0000) GS:ffff88306fd40000(0000) knlGS:0000000000000000
[107649.323296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[107649.329808] CR2: 00000000000001be CR3: 0000002fe8812000 CR4: 00000000000007e0
[107649.337873] Stack:
[107649.340217]  ffffffff81090826 ffffffffffffffea ffff882feafab000 00007f33619b2758
[107649.348616]  0000000000000004 000000000143ccb9 ffff882fe8e93f18 ffffffff81090826
[107649.357022]  ffffffffffffffea 00007f3354001150 ffff882fe8e93f78 ffffffff811bb87a
[107649.365419] Call Trace:
[107649.368258]  [<ffffffff81090826>] ? prepare_creds+0x26/0x1c0
[107649.374677]  [<ffffffff81090826>] prepare_creds+0x26/0x1c0
[107649.380901]  [<ffffffff811bb87a>] SyS_access+0x3a/0x1f0
[107649.386840]  [<ffffffff8172b15a>] ? do_page_fault+0x1a/0x70
[107649.393161]  [<ffffffff81728115>] ? do_device_not_available+0x35/0x50
[107649.400451]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[107649.407255] Code: dd 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
[107649.429252] RIP  [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.436170]  RSP <ffff882fe8e93ec8>
[107649.440153] CR2: 00000000000001be
[107649.444658] ---[ end trace d095b99b16180cb8 ]---
[107649.461632] BUG: unable to handle kernel NULL pointer dereference at 00000000000001be
[107649.470504] IP: [<ffffffff811a2470>] kmem_cache_alloc_trace+0x80/0x1f0
[107649.477909] PGD 2fe614a067 PUD 2fe623e067 PMD 0
[107649.483196] Oops: 0000 [#13] SMP
[107649.487022] Modules linked in: 8021q garp stp mrp llc gpio_ich intel_powerclamp coretemp joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i7core_edac ioatdma edac_core lpc_ich dca acpi_power_meter ipmi_si mac_hid lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 hid_generic lpfc mptsas raid0 mptscsih multipath scsi_transport_fc usbhid mptbase linear usb_storage hid scsi_transport_sas scsi_tgt bnx2
[107649.541067] CPU: 21 PID: 32766 Comm: diamond Tainted: G      D   I   3.13.0-39-generic #66-Ubuntu
[107649.551072] Hardware name: Cisco Systems Inc R250-2480805/R250-2480805, BIOS C250.1.4.1b.0.080420110259 08/04/2011
[1076

analytics1027 done. Hue didn't work outright because it ships with its own virtual env at /usr/lib/hue/build/env. I had to symlink /usr/lib/hue/build/env/bin/python2.7 -> /usr/lib/python2.7 for this to work properly. I did not puppetize this, because it should be fixed with the new Hue in CDH5.2 (seeing as it comes in a Trusty package).

Hm, might want to chill on the CDH 5.2 upgrade until this is fixed: https://issues.apache.org/jira/browse/HDFS-7489

From: https://wikitech.wikimedia.org/wiki/Distribution_upgrades

"python-diamond has transitioned to diamond, after the trusty-wikimedia repo has been re-enabled upgrade with:

apt-get update ; apt-get -y install diamond python-diamond

Will run this on all nodes after they are upgraded.

Done today:

Only the Kafka Brokers (will be tracked in a different ticket) and the new Hadoop Namenodes are left.

Ottomata mentioned this in rOPUP14de168a8a48: Include base puppetization on new nodes analytics1001 and analytics1002.Jan 7 2015, 9:54 PM

analytics1001 and analytics1002 have been provisioned, and the Hadoop NameNode and YARN master services have been migrated off of analytics1010 and analytics1004 (ciscos). The older Ciscos can now be decommissioned.

This means that all Hadoop NameNodes are now running on Trusty, woo!

I can now practice migrating to CDH 5.3, and puppetize it. I will do so first in Vagrant, then in labs, before attempting to upgrade the production cluster.

Ottomata added a parent task: T83601: Better way to access Hadoop related web GUIs.Jan 20 2015, 11:21 PM

Ottomata mentioned this in T87206: Hadoop logs on logstash are being really spammy.Jan 26 2015, 9:49 PM

Ottomata updated the task description. (Show Details)Feb 11 2015, 3:32 PM

• ggellerman added a project: Analytics-Kanban.Feb 11 2015, 9:39 PM

Today I practiced this in Vagrant and in Labs. I'd like to do it one more time in labs. My preliminary procedure will be this:

(compiled from http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_earlier_cdh5_upgrade.html)

0. apt-get update everywhere

1. stop puppet everywhere, unmount /mnt/hdfs on stat1002 and analytics1027 just to be safe:
sudo umount /mnt/hdfs

2. snapshot namenode metadata
sudo -u hdfs hdfs dfsadmin -safemode enter
sudo -u hdfs hdfs dfsadmin -saveNamespace

3. Stop all hadoop services everywhere.
for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done

4. backup namenode metadata
sudo tar -cvf ~/hadoop-namedir-backup.$(hostname).$(date +%s) /var/lib/hadoop/name

5. Install new packages everywhere:

# analytics1001 - YARN master and Primary NameNode
apt-get install hadoop-yarn-resourcemanager hadoop-hdfs-namenode hadoop-mapreduce-historyserver hadoop-httpfs

# analytics1002 - Standby NameNode
apt-get install hadoop-hdfs-namenode

# JournalNodes
apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce hadoop-hdfs-journalnode hive-hcatalog parquet parquet-format

# Workers
apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce hive-hcatalog parquet parquet-format

# Clients
apt-get install hadoop-client hive-hcatalog hive mahout pig sqoop oozie-client parquet parquet-format

# analytics1027
apt-get install hadoop-client hive-hcatalog hive hive-metastore hive-server2 mahout pig sqoop oozie parquet parquet-format

6. Start Journalnodes
sudo service hadoop-hdfs-journalnode start
# Important: in an HA deployment, the JournalNodes must be up and running CDH 5 before you proceed.

7. Upgrade NameNode metadata on analytics1001:
sudo service hadoop-hdfs-namenode upgrade
sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log
# Look for a line that confirms the upgrade is complete, such as: /var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
# The NameNode upgrade process can take a while, depending on how many files you have.

8. Bootstrap Standby NameNode on analytics1002:
sudo -u hdfs hdfs namenode -bootstrapStandby
sudo service hadoop-hdfs-namenode start

9. Start all DataNodes
sudo service hadoop-hdfs-datanode start

10. Start YARN resourcemanager on analytics1001:
sudo service hadoop-yarn-resourcemanager start

11. Start all NodeManagers:
sudo service hadoop-yarn-nodemanager start

12. Start MapReduce JobHistory Server on analytics1001:
sudo service hadoop-mapreduce-historyserver start

13. Upgrade Hive - http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hive_upgrade.html
# on analytics1027
sudo service hive-server2 stop
sudo service hive-metastore stop
# backup MySQL db
sudo mysqldump -u root hive_metastore > ~/hive_metastore-backup.$(date +%s).sql
# Upgrade metastore db
sudo /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 0.12.0
sudo service hive-metastore start
sudo service hive-server2 start

14. Upgrade Oozie
# on analytics1027
sudo service oozie stop
# backup MySQL db
sudo mysqldump -u root oozie > ~/oozie-backup.$(date +%s).sql
sudo apt-get install oozie oozie-client
# Upgrade the oozie db
sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh upgrade -run
# Upgrade the Oozie Shared Library
sudo -u oozie hadoop fs -rmr /user/oozie/share
sudo oozie-setup sharelib create -fs hdfs://analytics-hadoop -locallib /usr/lib/oozie/oozie-sharelib-yarn.tar.gz
sudo service oozie start

15. Upgrade Hue
# on analytics1027
sudo service hue stop
sudo apt-get install hue

16. Finalize the HDFS Metadata upgrade
# on analytics1001
sudo -u hdfs hdfs dfsadmin -finalizeUpgrade

(Hopefully) DONE!

• kevinator moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 12 2015, 3:04 PM

Alright! Some oozie jobs are busy backfilling the time that they were offline, but everything is looking good. I'm going to wait a day or two make sure everything is fine, but thus far I think we are done.

All is good!

• kevinator moved this task from In Progress to Done on the Analytics-Kanban board.Feb 24 2015, 3:16 PM

Liuxinyu970226 unsubscribed.Feb 25 2015, 8:58 AM

• kevinator edited projects, added Analytics-Clusters; removed Analytics.Feb 27 2015, 6:00 PM

	F826: grub_header_patch.debian-bug.754921
	Nov 14 2014, 4:49 PM

Upgrade Analytics Cluster to Trusty, and then to CDH 5.3Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Upgrade Analytics Cluster to Trusty, and then to CDH 5.3
Closed, ResolvedPublic
Actions

Related Objects
Search...