Page MenuHomePhabricator

Upgrade the Analytics Hadoop cluster to Apache Bigtop
Closed, ResolvedPublic


The Analytics Hadoop cluster will change Hadoop distribution, from Cloudera CDH 5.16 to Apache Bigtop 1.5.

Several big changes listed, the most important ones:

  • hadoop 2.6.x -> 2.10.1
  • hive 1.1 -> 2.3.6

Even if the Hadoop version bump seems minor, it will be in practice a complex one requiring downtime of the whole cluster for some hours. We have backed up all the important data that cannot be re-created (like Pageviews etc..) in a separate cluster, so in case the upgrade goes sideways we'll have a good recovery path.

The upgrade is scheduled for February 9th (Tuesday), and it should last from 2 to 4 hours in the EU morning (more precise timings will be added).


High level maintenance window: 8AM -> 20PM CET

UPDATE: we extended the maintenance window to 14PM due to backups taking a long time to complete.

UPDATE 2: due to an upgrade issue, we extended again the maintenance to 16PM CET, really sorry :(

UPDATE 3: due to an upgrade issue, we extended again the maintenance to 20PM CET :(

Event Timeline

Change 661974 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Apache Bigtop 1.5 as default hadoop distro

There is one last thing that we haven't discussed, I forgot to follow up at the time. The avro-libs and parquet packages (that were shipped by CDH) are not present in Bigtop, but so far I haven't found any trace of issues because of that. The avro libs seems to be incorporated in the hadoop packages now, but for parquet we rely afaics on the hadoop libs shipped with various tools (hive, spark, druid, etc..).

# CDH worker
elukey@an-worker1080:~$ dpkg -L avro-libs | grep .jar

# Bigtop worker

elukey@an-worker1117:~$ dpkg -S avro | grep .jar
hadoop-client: /usr/lib/hadoop/client/avro-1.7.7.jar
hadoop-yarn: /usr/lib/hadoop-yarn/lib/avro-1.7.7.jar
hadoop-mapreduce: /usr/lib/hadoop-mapreduce/lib/avro-1.7.7.jar
spark2: /usr/lib/spark2/jars/avro-ipc-1.8.2.jar
sqoop: /usr/lib/sqoop/lib/parquet-avro-1.4.1.jar
hadoop: /usr/lib/hadoop/lib/avro-1.7.7.jar
spark2: /usr/lib/spark2/jars/avro-mapred-1.8.2-hadoop2.jar
hive: /usr/lib/hive/lib/avro-1.7.7.jar
hadoop-mapreduce: /usr/lib/hadoop-mapreduce/avro-1.7.7.jar
spark2: /usr/lib/spark2/jars/avro-1.8.2.jar
sqoop: /usr/lib/sqoop/lib/avro-1.7.5.jar
hadoop-client: /usr/lib/hadoop/client/avro.jar
sqoop: /usr/lib/sqoop/lib/avro-mapred-1.7.5-hadoop2.jar
# CDH worker

elukey@an-worker1080:~$ dpkg -L parquet | grep .jar

elukey@an-worker1080:~$ dpkg -L parquet-format | grep .jar

# Bigtop worker
elukey@an-worker1117:~$ dpkg -S parquet | grep .jar
sqoop: /usr/lib/sqoop/lib/parquet-column-1.4.1.jar
spark2: /usr/lib/spark2/jars/parquet-common-1.10.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-hadoop-1.4.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-avro-1.4.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-encoding-1.4.1.jar
spark2: /usr/lib/spark2/jars/parquet-hadoop-1.10.1.jar
spark2: /usr/lib/spark2/jars/parquet-column-1.10.1.jar
spark2: /usr/lib/spark2/jars/parquet-format-2.4.0.jar
spark2: /usr/lib/spark2/jars/parquet-jackson-1.10.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-common-1.4.1.jar
hive: /usr/lib/hive/lib/parquet-hadoop-bundle-1.8.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-generator-1.4.1.jar
spark2: /usr/lib/spark2/jars/parquet-hadoop-bundle-1.6.0.jar
spark2: /usr/lib/spark2/jars/parquet-encoding-1.10.1.jar
sqoop: /usr/lib/sqoop/lib/parquet-format-2.0.0.jar
sqoop: /usr/lib/sqoop/lib/parquet-jackson-1.4.1.jar
Avro use cases
  • sqoop and mediawiki history oozie coordinators seem to use org.apache.hadoop.hive.serde2.avro, so we should be covered.
  • Druid (if any) uses its own Avro extension with libs.
Parquet use cases

As far as I know we have the following use cases for Parquet:

  • oozie jobs like webrequest_load, ending up creating Parquet files in Hive. This work in Hadoop test, never seen an issue.
  • Druid indexation via Parquet (namely reading parquet files). This seems to be handled by the Druid parquet libs, see indexation configs ("inputFormat": ""). Druid is also packaged with a specific extension for Parquet.
  • Spark-based jobs reading/writing parquet files. mostly via Hive. The Navigation timing refine job has been running fine so far.

No idea if I have missed anything, but it seems that CDH was offering more than what we needed. Double checking with @Ottomata and @JAllemandou to be sure. Sorry for the last minute request but it fell behind the cracks :(

I don't have any knowledge of those specific packages being needed.

Change 661974 merged by Elukey:
[operations/puppet@production] Set Apache Bigtop 1.5 as default hadoop distro

Change 663003 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump Hadoop datanode heap to 8G

Change 663003 merged by Ottomata:
[operations/puppet@production] Bump Hadoop datanode heap to 8G

Change 663009 had a related patch set uploaded (by Ottomata; owner: Joal):
[operations/puppet@production] Reduce Hadoop Yarn available memory from 4G

elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)

Problems we have found a solution for (even if not great):

  • Hive jobs failing due to new reserved keywords in HQL
  • Hive jobs failing due to UDF type problem when used in CTE (weird)

Problems we have not yet solved:

  • cassandra loading jobs have both log4j-over-slf4j and slf4j-log4j12 jars on the class-path. Our jar contains the latter, I have not found where the former comes from.
  • refine job has an issue with URI being told relative while it seems absolute (Andrew is still working on it)

Change 663061 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine - use spark assembly without hadoop jars

refine job has an issue with URI being told relative while it seems absolute (Andrew is still working on it)

I think i fixed it. There is some very strange problem with the fact that we use a spark distribution with Hadoop 2.6 jars provided in it. They are in /usr/lib/spark2/jars as well as in the in we have in HDFS and are using for spark.yarn.archive. I created a new by just removing the hadoop jars, and was able to run Refine jobs using it. is a temporary to allow these jobs to work now. We'll need to work on figuring this out more thoroughly for all Spark usages later. Most likely we'll need to repackage spark without Hadoop and rebuild and use spark assembly zip too.

Not sure how this will affect other Spark jobs.

Change 663061 merged by Ottomata:
[operations/puppet@production] Refine - use spark assembly without hadoop jars

Change 663062 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix type in refine.pp spark conf

Change 663062 merged by Ottomata:
[operations/puppet@production] Fix type in refine.pp spark conf

The upgrade was completed in way more hours than expected, so I think it is good to explain exactly what happened.

High level background about Hadoop and the upgrade procedure

Hadoop is composed by two main systems on a very high level: Yarn and HDFS. The former deals with managing vcores and memory for compute on a pool of worker nodes, the latter is a distributed file system that Yarn works upon. Both systems follows the leader-worker pattern, namely there are a bunch of worker nodes that are told what to do from an active/standby set of leader nodes that keep the overall distributed state.

This upgrade was about moving from the Cloudera CDH distribution to the Apache Bigtop one, that implied upgrading Hadoop from 2.6 to 2.10.1. Even if it seems a minor upgrade on paper, in reality it involved a complex upgrade since the HDFS distributed file system was involved. Before starting we backed up ~400TB of non-recoverable data on a separate Hadoop cluster (we call it "backup", it is made by new hadoop worker nodes that will eventually join the main cluster).

The upgrade procedure was split into four main steps:

  1. Drain the cluster from any job/user and take the last backup of data.
  1. Stop the Hadoop cluster completely to allow maintenance.
  1. Upgrade all Hadoop packages and configs on Hadoop Master/Worker nodes.
  1. Upgrade Hadoop packages on satellite systems and clients (Druid, Presto, stat100x, etc..)

We thought that all the steps would have lasted 4/6 hours maximum (without rushing), but in the end we ended up spending 12/14 hours on it. The main issue happened right after the step 3), since the HDFS file system status right after the upgrade was not consistent and messed up, and it took a while to figure out the exact root cause and how to fix it. Once we fixed the HDFS status, we continued with the upgrade and we completed it without other issues.

If you want to know more about the details of the HDFS inconsistent state, please read below, otherwise feel free to skip :)

HDFS inconsistent status after the upgrade

The HDFS file system is composed by two main categories of services:

  • Namenode (usually two): the masters that keep the state of the distributed file systems (files and where their blocks are stored)
  • Datanode (usually a lot): the workers running on every Hadoop worker node. They don't know anything about files, but only about what blocks they manage. The Namenodes are the ones dealing with clients.

During a cluster bootstrap (like the one right after the package upgrade) the Datanodes contact the Namenodes, sending block reports and their status. The Namenodes collect all the info about blocks, and compare with their view of the file system. Any inconsistency is reported in several metrics and monitored by alerts. The Namenodes stay in a status called "Safe mode" (to simplify, file system in read only) until a safe threshold of blocks reported is reached, to make sure that the cluster is healthy.

Right after the package upgrade, only half of the Datanodes registered themselves correctly to the Namenodes. We started to investigate why, but from the logs it was not immediately clear if the problem was a data corruption one or something else. The Datanode daemons on some nodes reported failures while preparing the directories for the upgrade, together a lot of stalls due to GC activity and memory pressure.

Before proceeding, a little info about what the Datanodes do to prepare for an upgrade. When they realize that the format of the blocks that they manage is lower than their version (in this case, 2.6 vs 2.10.1), they split files into two main directories, previous and current. The idea is to cleverly use hard-links to have a sort of copy-on-write status of the file system right after the upgrade, so that a rollback is possible if needed (leveraging the previous directory and its blocks).

We realized that some of the Datanodes failing to report blocks were trashing due to memory pressure and GC activity, caused by scanning block files on the host's file system and creating hard links. The Datanodes ran fine with 4G of Java heap size up to that moment, but for the upgrade they needed way more (we think due to a bug/perf-problem, probably all the block file names/paths were loaded on the heap rather than following a more conservative approach).

At this point we tried to increase the Java heap size on some canary workers, and after some tries a 16G (!!) max heap led to some results, namely some Datanodes started to report blocks to the Namenodes. We kept going with more Datanodes, until we eventually the Namenodes were reporting that all Datanodes sent block reports.

At this point we hoped that we were out of the mud, but it was not the case. The Namenodes were refusing to leave safe mode, due to less blocks reported than expected. After reading a lot of confusing and cryptic logs on the Namenodes, we identified some Datanodes that we thought didn't bootstrap correctly, reporting less blocks than what available on disk, so we tried to selectively restart again some of them. This led to positive results, the Namenodes correctly left safe mode.

Not yet out of the mud, sadly. The Namenodes at this point reported the following:

  • ~40000 of missing blocks
  • ~500 corrupted blocks
  • ~43M blocks
  • ~12M under replicated blocks

Each block is replicated three times (on different datanodes) so the missing blocks, even if not a lot compared to the total, were not a good sign since they meant that some blocks didn't have any of the three replicas available. After a long research among the logs, we decided to start a roll restart of all the datanodes on the workers (59), that took a while since we needed to do it slowly to avoid causing more damages. After the roll restart, we finally reached a consistent state, the overall problems disappeared and we were able to finally keep going with the migration.