The idea is to apt-get remove packages that are currently deployed on Hadoop nodes and not used. This clean up is needed to establish exactly what packages will be needed in another distro like BigTop.
Description
Event Timeline
elukey@an-worker1080:~$ dpkg -l | grep cdh ii avro-libs 1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Data serialization system ii bigtop-jsvc 0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 Application to launch java daemon ii bigtop-tomcat 0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache Tomcat ii bigtop-utils 0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Collection of useful tools for Bigtop ii flume-ng 1.6.0+cdh5.16.1+192-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Flume is a reliable, scalable, and manageable distributed log collection application for collecting data such as logs and delivering it to data stores such as Hadoop's HDFS. ii hadoop 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A software platform for processing vast amounts of data ii hadoop-0.20-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 A software platform for processing vast amounts of data ii hadoop-client 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hadoop client side dependencies ii hadoop-hdfs 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop Distributed File System ii hadoop-hdfs-datanode 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Data Node for Hadoop ii hadoop-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop MapReduce (MRv2) ii hadoop-yarn 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop NextGen MapReduce (YARN) ii hadoop-yarn-nodemanager 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Node manager for Hadoop ii hive 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hive is a data warehouse infrastructure built on top of Hadoop ii hive-hcatalog 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache HCatalog is a table and storage management service. ii hive-jdbc 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Provides libraries necessary to connect to Apache Hive via JDBC ii kite 1.0.0+cdh5.16.1+151-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Kite Software Development Kit. ii libhdfs0 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 JNI Bindings to access Hadoop HDFS from C ii parquet 1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A columnar storage format for Hadoop. ii parquet-format 2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Format definitions for Parquet ii sentry 1.5.1+cdh5.16.1+559-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. ii solr 4.10.3+cdh5.16.1+532-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache Solr is the popular, blazing fast open source enterprise search platform ii spark-core 1.6.0+cdh5.16.1+577-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Lightning-Fast Cluster Computing ii sqoop 1.4.6+cdh5.16.1+140-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Tool for easy imports and exports of data sets between databases and HDFS ii zookeeper 3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A high-performance coordination service for distributed applications.
For all the workers, I'd say that the following packages are not needed: flume-ng, kite, sentry, solr, spark-core
elukey@an-coord1001:~$ dpkg -l | grep cdh ii avro-libs 1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Data serialization system ii bigtop-jsvc 0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 Application to launch java daemon ii bigtop-tomcat 0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache Tomcat ii bigtop-utils 0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Collection of useful tools for Bigtop ii flume-ng 1.6.0+cdh5.16.1+192-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Flume is a reliable, scalable, and manageable distributed log collection application for collecting data such as logs and delivering it to data stores such as Hadoop's HDFS. ii hadoop 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A software platform for processing vast amounts of data ii hadoop-0.20-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 A software platform for processing vast amounts of data ii hadoop-client 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hadoop client side dependencies ii hadoop-hdfs 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop Distributed File System ii hadoop-hdfs-fuse 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 HDFS exposed over a Filesystem in Userspace ii hadoop-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop MapReduce (MRv2) ii hadoop-yarn 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop NextGen MapReduce (YARN) ii hbase 1.2.0+cdh5.16.1+482-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. ii hive 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hive is a data warehouse infrastructure built on top of Hadoop ii hive-hcatalog 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache HCatalog is a table and storage management service. ii hive-jdbc 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Provides libraries necessary to connect to Apache Hive via JDBC ii hive-metastore 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Shared metadata repository for Hive ii hive-server2 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Provides a Hive Thrift service with improved concurrency support. ii hive-webhcat 1.1.0+cdh5.16.1+1431-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all WebHcat provides a REST-like web API for HCatalog and related Hadoop components. ii kite 1.0.0+cdh5.16.1+151-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Kite Software Development Kit. ii libhdfs0 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 JNI Bindings to access Hadoop HDFS from C ii mahout 0.9+cdh5.16.1+38-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A set of Java libraries for scalable machine learning. ii oozie 4.1.0+cdh5.16.1+503-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Oozie is a system that runs workflows of Hadoop jobs. ii oozie-client 4.1.0+cdh5.16.1+503-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Client for Oozie Workflow Engine ii parquet 1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A columnar storage format for Hadoop. ii parquet-format 2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Format definitions for Parquet ii pig 0.12.0+cdh5.16.1+117-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Pig is a platform for analyzing large data sets ii pig-udf-datafu 1.1.0+cdh5.16.1+29-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A collection of user-defined functions for Hadoop and Pig. ii sentry 1.5.1+cdh5.16.1+559-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. ii solr 4.10.3+cdh5.16.1+532-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Apache Solr is the popular, blazing fast open source enterprise search platform ii spark-core 1.6.0+cdh5.16.1+577-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Lightning-Fast Cluster Computing ii sqoop 1.4.6+cdh5.16.1+140-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Tool for easy imports and exports of data sets between databases and HDFS ii zookeeper 3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A high-performance coordination service for distributed applications.
Probably not needed: flume-ng,hbase,kite,mahout,pig,pig-udf-datafu,sentry,solr,spark-core
elukey@an-master1001:~$ dpkg -l | grep cdh ii avro-libs 1.7.6+cdh5.16.1+143-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Data serialization system ii bigtop-jsvc 0.6.0+cdh5.16.1+934-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 Application to launch java daemon ii bigtop-utils 0.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Collection of useful tools for Bigtop ii hadoop 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A software platform for processing vast amounts of data ii hadoop-0.20-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 A software platform for processing vast amounts of data ii hadoop-client 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hadoop client side dependencies ii hadoop-hdfs 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop Distributed File System ii hadoop-hdfs-namenode 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Name Node for Hadoop ii hadoop-hdfs-zkfc 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Hadoop HDFS failover controller ii hadoop-mapreduce 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop MapReduce (MRv2) ii hadoop-mapreduce-historyserver 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all MapReduce History Server ii hadoop-yarn 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all The Hadoop NextGen MapReduce (YARN) ii hadoop-yarn-resourcemanager 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Resource manager for Hadoop ii libhdfs0 2.6.0+cdh5.16.1+2848-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 amd64 JNI Bindings to access Hadoop HDFS from C ii parquet 1.5.0+cdh5.16.1+200-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A columnar storage format for Hadoop. ii parquet-format 2.1.0+cdh5.16.1+22-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all Format definitions for Parquet ii zookeeper 3.4.5+cdh5.16.1+155-1.cdh5.16.1.p0.3~jessie-cdh5.16.1 all A high-performance coordination service for distributed applications.
This seems already good!
Turned out that for a lot of reverse deps the only thing that I was able to remove was flume-ng spark-core spark-python from the Hadoop test workers. Waiting a day and applying it also to Hadoop Analytic as well..
Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:37:08Z] <elukey> remove spark-core flume-ng from all the hadoop workers - T242754
Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:39:20Z] <elukey> remove flume-ng from all stat/notebooks - T242754
Mentioned in SAL (#wikimedia-analytics) [2020-01-15T10:44:17Z] <elukey> remove flume-ng and spark-python/core packages from an-coord1001,analytics1030,analytics-tool1001,analytics1039 - T242754