Page MenuHomePhabricator

Upgrade Analytics Cluster to Trusty, and then to CDH 5.3
Closed, ResolvedPublic

Description

CDH 5.3 packages are now available for Trusty, woot! I'd like to upgrade all analytics machines to Trusty first, and then upgrade to CDH 5.3. I will use this task to track and plan work on both. I would like to do-release-upgrade these, as I have had a good experience doing this on stat* boxes, and it will much simpler to not have to reinstall the OS for these nodes.

There are several services in the Analytics Cluster, and upgrades to those should be planned and done as units. Let's start from easiest to hardest.

analytics1003 and analytics1026 don't host any (real) production critical services. Let's do those first. I should be able to upgrade those without much fanfare. Once those are done, let's do analytics1027. Once that is done, then we will discuss the next steps.

Event Timeline

Ottomata created this task.Nov 10 2014, 4:53 PM
Ottomata claimed this task.
Ottomata raised the priority of this task from to Normal.
Ottomata updated the task description. (Show Details)
Ottomata added projects: acl*sre-team, Analytics.
Ottomata changed Security from none to None.
Ottomata added subscribers: Ottomata, QChris, Gage.
Ottomata added a comment.EditedNov 13 2014, 4:53 PM

I'm starting the upgrade of analytics1026 now. Process:

  • schedule downtime in icinga
  • disable pupppet
  • start udp2log instance running sqstat on stat1002
/usr/bin/udp2log --config-file=/home/otto/udp2log.sqstat.conf --pid-file /tmp/udp2log.sqstat.pid -p 8420 --multicast 233.58.59.1 --recv-queue=524288
  • stop udp2log on analytics1026
  • Do the upgrade
apt-get update
apt-get upgrade
apt-get install update-manager-core
export http_proxy=http://webproxy.eqiad.wmnet:8080
do-release-upgrade
  • stop udp2log on stat1002
  • make sure udp2log and sqstat are running on analytics1026

analytics1026 is done.

Starting on analytics1003. Same plan as above, but without the udp2log bits. There are no prod services on analytics1003.

analytics1003 is done.

Going to start on Zookeepers now: analytics1023,1024,1025. Will do them in that order, one at a time.

Process is simple, just do the upgrade and reboot.

I just finished analytics1023, but accidentally upgraded from zookeeper 3.3.5 -> 3.4.5 in the process. I had forgotten that we don't use the CDH zookeeper package, but the ones from Ubuntu/Debian. So, an apt-get upgrade upgraded it.

Everything looks fine though, but just in case I will wait until tomorrow before I continue with the other 2.

Things are looking fine. I'm going to continue with analytics1024 and then analytics1025. Once they are upgraded I will modify puppet so it is happy again.

analytics1024 is in progress.

Something strange happened with analytics1003 (a Cisco) overnight. It was down this morning, and console showed it stuck on some reboot loop. I power cycled and it came back up fine. However, there was a warning about Diskfilter writes are not supported on boot. I googled, and came up with this:

http://askubuntu.com/questions/468466/why-this-occurs-error-diskfilter-writes-are-not-supported/498281#498281, which eventually leads to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754921.

I applied this patch on analytics1003 and rebooted, and was no longer saw this warning.

I just noticed the same diskfilter warning on analytics1024. I think we should apply this to each node after we upgrade. I've attached the patched 00_header file.

Steps are to download this file and then:

mv /etc/grub.d/00_header /etc/grub.d/00_header.debian-bug-754921.orig
mv grub_header_patch.debian-bug.754921 /etc/grub.d/00_header
chmod -x /etc/grub.d/00_header.debian-bug-754921.orig; chmod +x /etc/grub.d/00_header
update-grub

analytics1024 is done, moving on to analytics1025.

analytics1025 is done.

Kafkas are next: analytics1012, 1018, 1021, 1022. Will do them one at a time in that order.

Actually, on second thought, there is no immediate need to upgrade the Kafka servers to Trusty, and to do so I would have to build a Trusty version of the .deb.

Kafka 0.8.2 will be released very soon. To minimize work, I think I will wait until 0.8.2 is out and coordinate the upgrade to Trusty with the upgrade of Kafka itself.

Next up is the Hadoop cluster then. Since it is Friday, I will wait until next week before continuing.

Ottomata added a comment.EditedNov 17 2014, 5:43 PM

I will continue with the upgrade to Trusty for Hadoop Workers this week. I'm starting on them now, beginning with analytics1013. The process is:

apt-get update
apt-get upgrade
apt-get install update-manager-core
export http_proxy=http://webproxy.eqiad.wmnet:8080
do-release-upgrade

I did analytics1013 and 1014 yesterday. Starting with 1015 today.

Done today:

1015
1016
1017
1019
1020
1029
1030
1031
1032

Something is weird with 1033's console. I will come back to it.

I noticed 1033's console problem too, and created an RT ticket: https://rt.wikimedia.org/Ticket/Display.html?id=8858

Completed today:

1034
1035
1036
1037
1038
1039
1040
1041

I need to figure out what's up with 1033's console. Then I can proceed with analytics1027. Then I need to think about namenodes. I'm worried about those Ciscos!

Ottomata raised the priority of this task from Normal to High.Dec 1 2014, 3:28 PM
QChris added a comment.Dec 2 2014, 1:15 PM

Since this has been bumped in priority, what do we do with failing nodes as analytics1003?

Uh, maybe this is normal priority. :p

I am waiting for Chris Johnson to take a look.

Ottomata lowered the priority of this task from High to Normal.Dec 2 2014, 2:03 PM

Waiting to catch another kernel panic on analytics1003, and then I will attempt a full reinstall. Current panic output is:

CISCO Serial Over LAN:
Close Network Connection to Exit
149325] BUG: unable to handle kernel NULL pointer dereference at 00000000000001be
[107649.158194] IP: [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.165022] PGD 2fe614a067 PUD 2fe623e067 PMD 0
[107649.170317] Oops: 0000 [#12] SMP
[107649.174143] Modules linked in: 8021q garp stp mrp llc gpio_ich intel_powerclamp coretemp joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i7core_edac ioatdma edac_core lpc_ich dca acpi_power_meter ipmi_si mac_hid lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 hid_generic lpfc mptsas raid0 mptscsih multipath scsi_transport_fc usbhid mptbase linear usb_storage hid scsi_transport_sas scsi_tgt bnx2
[107649.228195] CPU: 21 PID: 32765 Comm: diamond Tainted: G      D   I   3.13.0-39-generic #66-Ubuntu
[107649.238200] Hardware name: Cisco Systems Inc R250-2480805/R250-2480805, BIOS C250.1.4.1b.0.080420110259 08/04/2011
[107649.249853] task: ffff882feafab000 ti: ffff882fe8e92000 task.ti: ffff882fe8e92000
[107649.258305] RIP: 0010:[<ffffffff811a2285>]  [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.267823] RSP: 0018:ffff882fe8e93ec8  EFLAGS: 00010286
[107649.273852] RAX: 0000000000000000 RBX: ffffffffffffffea RCX: 0000000000190ffa
[107649.281916] RDX: 0000000000190ff9 RSI: 00000000000000d0 RDI: ffff88180f803800
[107649.289971] RBP: ffff882fe8e93ef8 R08: 0000000000017280 R09: ffffffff81090826
[107649.298034] R10: 0000000000001fc0 R11: 0000000000000246 R12: 00000000000001be
[107649.306098] R13: 00000000000000d0 R14: ffff88180f803800 R15: ffff88180f803800
[107649.314163] FS:  00007f335ddf0700(0000) GS:ffff88306fd40000(0000) knlGS:0000000000000000
[107649.323296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[107649.329808] CR2: 00000000000001be CR3: 0000002fe8812000 CR4: 00000000000007e0
[107649.337873] Stack:
[107649.340217]  ffffffff81090826 ffffffffffffffea ffff882feafab000 00007f33619b2758
[107649.348616]  0000000000000004 000000000143ccb9 ffff882fe8e93f18 ffffffff81090826
[107649.357022]  ffffffffffffffea 00007f3354001150 ffff882fe8e93f78 ffffffff811bb87a
[107649.365419] Call Trace:
[107649.368258]  [<ffffffff81090826>] ? prepare_creds+0x26/0x1c0
[107649.374677]  [<ffffffff81090826>] prepare_creds+0x26/0x1c0
[107649.380901]  [<ffffffff811bb87a>] SyS_access+0x3a/0x1f0
[107649.386840]  [<ffffffff8172b15a>] ? do_page_fault+0x1a/0x70
[107649.393161]  [<ffffffff81728115>] ? do_device_not_available+0x35/0x50
[107649.400451]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[107649.407255] Code: dd 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
[107649.429252] RIP  [<ffffffff811a2285>] kmem_cache_alloc+0x75/0x1e0
[107649.436170]  RSP <ffff882fe8e93ec8>
[107649.440153] CR2: 00000000000001be
[107649.444658] ---[ end trace d095b99b16180cb8 ]---
[107649.461632] BUG: unable to handle kernel NULL pointer dereference at 00000000000001be
[107649.470504] IP: [<ffffffff811a2470>] kmem_cache_alloc_trace+0x80/0x1f0
[107649.477909] PGD 2fe614a067 PUD 2fe623e067 PMD 0
[107649.483196] Oops: 0000 [#13] SMP
[107649.487022] Modules linked in: 8021q garp stp mrp llc gpio_ich intel_powerclamp coretemp joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i7core_edac ioatdma edac_core lpc_ich dca acpi_power_meter ipmi_si mac_hid lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 hid_generic lpfc mptsas raid0 mptscsih multipath scsi_transport_fc usbhid mptbase linear usb_storage hid scsi_transport_sas scsi_tgt bnx2
[107649.541067] CPU: 21 PID: 32766 Comm: diamond Tainted: G      D   I   3.13.0-39-generic #66-Ubuntu
[107649.551072] Hardware name: Cisco Systems Inc R250-2480805/R250-2480805, BIOS C250.1.4.1b.0.080420110259 08/04/2011
[1076

analytics1027 done. Hue didn't work outright because it ships with its own virtual env at /usr/lib/hue/build/env. I had to symlink /usr/lib/hue/build/env/bin/python2.7 -> /usr/lib/python2.7 for this to work properly. I did not puppetize this, because it should be fixed with the new Hue in CDH5.2 (seeing as it comes in a Trusty package).

Hm, might want to chill on the CDH 5.2 upgrade until this is fixed: https://issues.apache.org/jira/browse/HDFS-7489

From: https://wikitech.wikimedia.org/wiki/Distribution_upgrades

"python-diamond has transitioned to diamond, after the trusty-wikimedia repo has been re-enabled upgrade with:

apt-get update ; apt-get -y install diamond python-diamond

"

Will run this on all nodes after they are upgraded.

Done today:

1033
1011
1019
1028

Only the Kafka Brokers (will be tracked in a different ticket) and the new Hadoop Namenodes are left.

Ottomata renamed this task from Upgrade Analytics Cluster to Trusty, and then to CDH 5.2 to Upgrade Analytics Cluster to Trusty, and then to CDH 5.3.Jan 16 2015, 4:51 PM

analytics1001 and analytics1002 have been provisioned, and the Hadoop NameNode and YARN master services have been migrated off of analytics1010 and analytics1004 (ciscos). The older Ciscos can now be decommissioned.

This means that all Hadoop NameNodes are now running on Trusty, woo!

I can now practice migrating to CDH 5.3, and puppetize it. I will do so first in Vagrant, then in labs, before attempting to upgrade the production cluster.

Ottomata updated the task description. (Show Details)Feb 11 2015, 3:32 PM
Ottomata added a comment.EditedFeb 11 2015, 11:22 PM

Today I practiced this in Vagrant and in Labs. I'd like to do it one more time in labs. My preliminary procedure will be this:

(compiled from http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_earlier_cdh5_upgrade.html)

 0. apt-get update everywhere

 1. stop puppet everywhere, unmount /mnt/hdfs on stat1002 and analytics1027 just to be safe: 
    sudo umount /mnt/hdfs

 2. snapshot namenode metadata
    sudo -u hdfs hdfs dfsadmin -safemode enter
    sudo -u hdfs hdfs dfsadmin -saveNamespace

 3. Stop all hadoop services everywhere.
    for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done

 4. backup namenode metadata
   sudo tar -cvf  ~/hadoop-namedir-backup.$(hostname).$(date +%s) /var/lib/hadoop/name

 5. Install new packages everywhere:
 
    # analytics1001 - YARN master and Primary NameNode
    apt-get install hadoop-yarn-resourcemanager hadoop-hdfs-namenode hadoop-mapreduce-historyserver hadoop-httpfs
    
    # analytics1002 - Standby NameNode
    apt-get install hadoop-hdfs-namenode
    
    # JournalNodes
    apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce hadoop-hdfs-journalnode hive-hcatalog parquet parquet-format
    
    # Workers
    apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce hive-hcatalog parquet parquet-format
    
    # Clients
    apt-get install hadoop-client hive-hcatalog hive mahout pig sqoop oozie-client parquet parquet-format
 
    # analytics1027
    apt-get install hadoop-client hive-hcatalog hive hive-metastore hive-server2 mahout pig sqoop oozie parquet parquet-format
 
 6. Start Journalnodes
    sudo service hadoop-hdfs-journalnode start
    # Important: in an HA deployment, the JournalNodes must be up and running CDH 5 before you proceed.
 
 7. Upgrade NameNode metadata on analytics1001:
    sudo service hadoop-hdfs-namenode upgrade
    sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log 
    # Look for a line that confirms the upgrade is complete, such as: /var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
    # The NameNode upgrade process can take a while, depending on how many files you have.
 
 8. Bootstrap Standby NameNode on analytics1002:
    sudo -u hdfs hdfs namenode -bootstrapStandby
    sudo service hadoop-hdfs-namenode start
 
 9. Start all DataNodes
    sudo service hadoop-hdfs-datanode start

10. Start YARN resourcemanager on analytics1001:
    sudo service hadoop-yarn-resourcemanager start

11. Start all NodeManagers:
    sudo service hadoop-yarn-nodemanager start

12. Start MapReduce JobHistory Server on analytics1001:
    sudo service hadoop-mapreduce-historyserver start 

13. Upgrade Hive - http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hive_upgrade.html
    # on analytics1027
    sudo service hive-server2 stop
    sudo service hive-metastore stop
    # backup MySQL db
    sudo mysqldump -u root hive_metastore > ~/hive_metastore-backup.$(date +%s).sql
    # Upgrade metastore db
    sudo /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 0.12.0
    sudo service hive-metastore start
    sudo service hive-server2 start
    
14. Upgrade Oozie
    # on analytics1027
    sudo service oozie stop
    # backup MySQL db
    sudo mysqldump -u root oozie > ~/oozie-backup.$(date +%s).sql
    sudo apt-get install oozie oozie-client
    # Upgrade the oozie db
    sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh upgrade -run
    # Upgrade the Oozie Shared Library
    sudo -u oozie hadoop fs -rmr /user/oozie/share
    sudo oozie-setup sharelib create -fs hdfs://analytics-hadoop -locallib /usr/lib/oozie/oozie-sharelib-yarn.tar.gz
    sudo service oozie start

15. Upgrade Hue
    # on analytics1027
    sudo service hue stop
    sudo apt-get install hue

16. Finalize the HDFS Metadata upgrade
    # on analytics1001
    sudo -u hdfs hdfs dfsadmin -finalizeUpgrade

(Hopefully) DONE!

Alright! Some oozie jobs are busy backfilling the time that they were offline, but everything is looking good. I'm going to wait a day or two make sure everything is fine, but thus far I think we are done.

Ottomata closed this task as Resolved.Feb 24 2015, 3:05 PM

All is good!