Page MenuHomePhabricator

Removed not used CDH packages from Hadoop nodes
Closed, ResolvedPublic

Description

The idea is to apt-get remove packages that are currently deployed on Hadoop nodes and not used. This clean up is needed to establish exactly what packages will be needed in another distro like BigTop.

Event Timeline

elukey@an-worker1080:~$ dpkg -l | grep cdh
ii  avro-libs                             1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Data serialization system
ii  bigtop-jsvc                           0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  amd64        Application to launch java daemon
ii  bigtop-tomcat                         0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1    all          Apache Tomcat
ii  bigtop-utils                          0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1    all          Collection of useful tools for Bigtop
ii  flume-ng                              1.6.0+cdh5.16.1+192-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Flume is a reliable, scalable, and manageable distributed log collection application for collecting data such as logs and delivering it to data stores such as Hadoop's HDFS.
ii  hadoop                                2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          A software platform for processing vast amounts of data
ii  hadoop-0.20-mapreduce                 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        A software platform for processing vast amounts of data
ii  hadoop-client                         2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hadoop client side dependencies
ii  hadoop-hdfs                           2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop Distributed File System
ii  hadoop-hdfs-datanode                  2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Data Node for Hadoop
ii  hadoop-mapreduce                      2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop MapReduce (MRv2)
ii  hadoop-yarn                           2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop NextGen MapReduce (YARN)
ii  hadoop-yarn-nodemanager               2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Node manager for Hadoop
ii  hive                                  1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hive is a data warehouse infrastructure built on top of Hadoop
ii  hive-hcatalog                         1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Apache HCatalog is a table and storage management service.
ii  hive-jdbc                             1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Provides libraries necessary to connect to Apache Hive via JDBC
ii  kite                                  1.0.0+cdh5.16.1+151-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Kite Software Development Kit.
ii  libhdfs0                              2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        JNI Bindings to access Hadoop HDFS from C
ii  parquet                               1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A columnar storage format for Hadoop.
ii  parquet-format                        2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1   all          Format definitions for Parquet
ii  sentry                                1.5.1+cdh5.16.1+559-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.
ii  solr                                  4.10.3+cdh5.16.1+532-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Apache Solr is the popular, blazing fast open source enterprise search platform
ii  spark-core                            1.6.0+cdh5.16.1+577-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Lightning-Fast Cluster Computing
ii  sqoop                                 1.4.6+cdh5.16.1+140-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Tool for easy imports and exports of data sets between databases and HDFS
ii  zookeeper                             3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A high-performance coordination service for distributed applications.

For all the workers, I'd say that the following packages are not needed: flume-ng, kite, sentry, solr, spark-core

elukey@an-coord1001:~$ dpkg -l | grep cdh
ii  avro-libs                                 1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Data serialization system
ii  bigtop-jsvc                               0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  amd64        Application to launch java daemon
ii  bigtop-tomcat                             0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1    all          Apache Tomcat
ii  bigtop-utils                              0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1    all          Collection of useful tools for Bigtop
ii  flume-ng                                  1.6.0+cdh5.16.1+192-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Flume is a reliable, scalable, and manageable distributed log collection application for collecting data such as logs and delivering it to data stores such as Hadoop's HDFS.
ii  hadoop                                    2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          A software platform for processing vast amounts of data
ii  hadoop-0.20-mapreduce                     2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        A software platform for processing vast amounts of data
ii  hadoop-client                             2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hadoop client side dependencies
ii  hadoop-hdfs                               2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop Distributed File System
ii  hadoop-hdfs-fuse                          2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        HDFS exposed over a Filesystem in Userspace
ii  hadoop-mapreduce                          2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop MapReduce (MRv2)
ii  hadoop-yarn                               2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop NextGen MapReduce (YARN)
ii  hbase                                     1.2.0+cdh5.16.1+482-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
ii  hive                                      1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hive is a data warehouse infrastructure built on top of Hadoop
ii  hive-hcatalog                             1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Apache HCatalog is a table and storage management service.
ii  hive-jdbc                                 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Provides libraries necessary to connect to Apache Hive via JDBC
ii  hive-metastore                            1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Shared metadata repository for Hive
ii  hive-server2                              1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Provides a Hive Thrift service with improved concurrency support.
ii  hive-webhcat                              1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          WebHcat provides a REST-like web API for HCatalog and related Hadoop components.
ii  kite                                      1.0.0+cdh5.16.1+151-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Kite Software Development Kit.
ii  libhdfs0                                  2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        JNI Bindings to access Hadoop HDFS from C
ii  mahout                                    0.9+cdh5.16.1+38-1.cdh5.16.1.p0.3~jessie-cdh5.16.1     all          A set of Java libraries for scalable machine learning.
ii  oozie                                     4.1.0+cdh5.16.1+503-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Oozie is a system that runs workflows of Hadoop jobs.
ii  oozie-client                              4.1.0+cdh5.16.1+503-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Client for Oozie Workflow Engine
ii  parquet                                   1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A columnar storage format for Hadoop.
ii  parquet-format                            2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1   all          Format definitions for Parquet
ii  pig                                       0.12.0+cdh5.16.1+117-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Pig is a platform for analyzing large data sets
ii  pig-udf-datafu                            1.1.0+cdh5.16.1+29-1.cdh5.16.1.p0.3~jessie-cdh5.16.1   all          A collection of user-defined functions for Hadoop and Pig.
ii  sentry                                    1.5.1+cdh5.16.1+559-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.
ii  solr                                      4.10.3+cdh5.16.1+532-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Apache Solr is the popular, blazing fast open source enterprise search platform
ii  spark-core                                1.6.0+cdh5.16.1+577-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Lightning-Fast Cluster Computing
ii  sqoop                                     1.4.6+cdh5.16.1+140-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Tool for easy imports and exports of data sets between databases and HDFS
ii  zookeeper                                 3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A high-performance coordination service for distributed applications.

Probably not needed: flume-ng,hbase,kite,mahout,pig,pig-udf-datafu,sentry,solr,spark-core

elukey@an-master1001:~$ dpkg -l | grep cdh
ii  avro-libs                            1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          Data serialization system
ii  bigtop-jsvc                          0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  amd64        Application to launch java daemon
ii  bigtop-utils                         0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1    all          Collection of useful tools for Bigtop
ii  hadoop                               2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          A software platform for processing vast amounts of data
ii  hadoop-0.20-mapreduce                2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        A software platform for processing vast amounts of data
ii  hadoop-client                        2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hadoop client side dependencies
ii  hadoop-hdfs                          2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop Distributed File System
ii  hadoop-hdfs-namenode                 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Name Node for Hadoop
ii  hadoop-hdfs-zkfc                     2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Hadoop HDFS failover controller
ii  hadoop-mapreduce                     2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop MapReduce (MRv2)
ii  hadoop-mapreduce-historyserver       2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          MapReduce History Server
ii  hadoop-yarn                          2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          The Hadoop NextGen MapReduce (YARN)
ii  hadoop-yarn-resourcemanager          2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all          Resource manager for Hadoop
ii  libhdfs0                             2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64        JNI Bindings to access Hadoop HDFS from C
ii  parquet                              1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A columnar storage format for Hadoop.
ii  parquet-format                       2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1   all          Format definitions for Parquet
ii  zookeeper                            3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1  all          A high-performance coordination service for distributed applications.

This seems already good!

Turned out that for a lot of reverse deps the only thing that I was able to remove was flume-ng spark-core spark-python from the Hadoop test workers. Waiting a day and applying it also to Hadoop Analytic as well..

Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:37:08Z] <elukey> remove spark-core flume-ng from all the hadoop workers - T242754

Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:39:20Z] <elukey> remove flume-ng from all stat/notebooks - T242754

Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:44:17Z] <elukey> remove flume-ng and spark-python/core packages from an-coord1001,analytics1030,analytics-tool1001,analytics1039 - T242754

elukey triaged this task as Medium priority.Jan 15 2020, 10:44 AM
elukey added a project: Analytics-Kanban.
elukey set Final Story Points to 3.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.