Page MenuHomePhabricator

[Hive] Investigate packaging, install, security monitoring.
Open, MediumPublic

Description

We need to investigate how we want to fetch, install and configure the service, and develop a process to monitor for vulnerabilities and handle software updates.

Apache Hive is distributed as a tarball of java code, scripts, etc.

Downloads are here: http://archive.apache.org/dist/hive

The project is tracked in github (apache/hive) but they don't tag releases there. The source for release notes so far appears to be here: https://github.com/apache/hive/blob/master/RELEASE_NOTES.txt but there's also info from Jira and Atlassian. I need to look more at this.

Event Timeline

I added a Package Getter module to fetch/verify/package this project. It isn't complete in that it doesn't fetch/display change logs yet, but it's enough to unblock progress on getting the service up and running. It installs to /usr/lib/hive.

Jgreen triaged this task as Medium priority.Oct 28 2024, 5:25 PM

We needed to add a mysql/mariadb java driver. Ended up going with mysql-connector-j installed from mysql.com's deb repository.

Gehel moved this task from Scratch to Watching on the Data-Platform-SRE board.

@Jgreen - you might like to know about Apache Bigtop: https://bigtop.apache.org/
It is mentioned on this page, which talks about different Hadoop distributions and suppliers.

Essentially, it's a framework for bulding, testing a distributing containers and packages of many different big data projects, including Hive.

You'll see from here: https://bigtop.apache.org/download.html#releases that there is a repository of Debian packages for Bigtop 3.3 available here: https://dlcdn.apache.org/bigtop/bigtop-3.3.0/repos/

This might be significantly easier than downloading and compiling Hive yourselves.
however, if you would like to build the packages yourself, you might also like to look at this README:
https://github.com/apache/bigtop/blob/master/README.md#for-users-creating-your-own-apache-hadoop-environment

The command to build hive debian packages for bullseye from a checkout of apache bigtop would be something like:

docker run --rm -u jenkins:jenkins -v `pwd`:/ws --workdir /ws bigtop/slaves:trunk-debian-11 bash -l -c './gradlew allclean ; ./gradlew hive-pkg'

We have some documentation on how we build our packages here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Bigtop_Packages but it's probably a bit specific for what you need.
In our case, we are stuck on some old versions and need our own fork in order to get them to build for bullseye. However, we're starting to plan seriously for an upgrade in T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.

We also have a bigtop module for puppet, which could potentially be useful for you.
If you have any questions, please do feel free to let us know and we'll try to help.

@Jgreen - you might like to know about Apache Bigtop: https://bigtop.apache.org/
It is mentioned on this page, which talks about different Hadoop distributions and suppliers.

Essentially, it's a framework for bulding, testing a distributing containers and packages of many different big data projects, including Hive.

You'll see from here: https://bigtop.apache.org/download.html#releases that there is a repository of Debian packages for Bigtop 3.3 available here: https://dlcdn.apache.org/bigtop/bigtop-3.3.0/repos/

This might be significantly easier than downloading and compiling Hive yourselves.
however, if you would like to build the packages yourself, you might also like to look at this README:
https://github.com/apache/bigtop/blob/master/README.md#for-users-creating-your-own-apache-hadoop-environment

The command to build hive debian packages for bullseye from a checkout of apache bigtop would be something like:

docker run --rm -u jenkins:jenkins -v `pwd`:/ws --workdir /ws bigtop/slaves:trunk-debian-11 bash -l -c './gradlew allclean ; ./gradlew hive-pkg'

We have some documentation on how we build our packages here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Bigtop_Packages but it's probably a bit specific for what you need.
In our case, we are stuck on some old versions and need our own fork in order to get them to build for bullseye. However, we're starting to plan seriously for an upgrade in T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.

We also have a bigtop module for puppet, which could potentially be useful for you.
If you have any questions, please do feel free to let us know and we'll try to help.

We aren't running docker etc so far, but it was pretty straightforward to automate the fetch/verify/create-deb-package from Apache's repository. This is now part of a grand unified Frack packaging scheme which we also use for things like netboot kernels, RAID tools, etc. Where we are now is that it is very easy to generate a package for any release that is available upstream.

However, I haven't yet figured out how to automate monitoring for security-related updates. For example for python projects, we wrapped pip-audit so we get a daily notification on vulnerabilities and updates. How are you managing vulnerabilities for java projects?

We aren't running docker etc so far, but it was pretty straightforward to automate the fetch/verify/create-deb-package from Apache's repository. This is now part of a grand unified Frack packaging scheme which we also use for things like netboot kernels, RAID tools, etc. Where we are now is that it is very easy to generate a package for any release that is available upstream.

This sounds really interesting. Could you share some details on this grand unified Frack packaging scheme, please @Jgreen ?
We might want to look into whether this could save us work on some of the components that we might not want to get from bigtop.

We aren't running docker etc so far, but it was pretty straightforward to automate the fetch/verify/create-deb-package from Apache's repository. This is now part of a grand unified Frack packaging scheme which we also use for things like netboot kernels, RAID tools, etc. Where we are now is that it is very easy to generate a package for any release that is available upstream.

This sounds really interesting. Could you share some details on this grand unified Frack packaging scheme, please @Jgreen ?
We might want to look into whether this could save us work on some of the components that we might not want to get from bigtop.

Sure, with a couple disclaimers:

  • our packaging requirements are light
  • it's very much a work in progress
  • it replaces a legacy shell script approach
  • it's python, written by me, so although it works it is not elegant :-P

There's a shared library pg/__init.py__ for reusable functions (~650 lines), some of the types of functions include:

  • fetch metadata (i.e. from github API or similar)
  • user interface (show what's in our repo already, display available releases and if possible changelogs, select which release to package),
  • fetching and verifying (checksum, gpg) files
  • prep debian packaging files from templates
  • spin up the right pbuilder container and create the package
  • add the package to the frack reprepro repository

Each automated package gets its own library (typically ~30 lines). Technically one library could handle many packages, but so far it has been cleaner to keep them separate. This library has two important functions:

  • available_releases - gathers upstream release metadata however it is available for the package in question
  • fetch_and build - should probably be called "main" - orchestrates everything

netboot_kernels.py for example packages latest pxeboot kernels for every Debian version we support:

"""
Thing which fetches Debian netboot kernels for addition to frack repo.
"""
import os
import shutil
import tarfile
from datetime import datetime, UTC
import pg

download_files = {
    "netboot.tar.gz": "main/installer-amd64/current/images/netboot/netboot.tar.gz",
    "SHA256SUMS": "main/installer-amd64/current/images/SHA256SUMS",
    "Release": "Release",
    "Release.gpg": "Release.gpg"
}

def prep_source(args):
    """Download netboot tarball, verify file, create orig.tar.gz"""
    build_dir = f"{args.work_dir}/tftpboot"
    pg.gpg_create_keyring(args, ["/etc/apt/trusted.gpg.d/debian-archive-*"])
    shutil.rmtree(build_dir, ignore_errors=True)
    os.mkdir(build_dir)
    for distro in args.distros:
        print(f"Processing {distro}")
        os.makedirs(f"{args.work_dir}/{distro}", exist_ok=True)
        os.chdir(f"{args.work_dir}/{distro}")
        for file, path in download_files.items():
            pg.fetch_file(args, f"{args.project['download_url']}/{distro}/{path}", file)
        pg.gpg_verify_file(args, "Release.gpg", "Release")
        checksum = pg.extract_checksum_from_file("Release", download_files["SHA256SUMS"])
        pg.checksum_verify_file("SHA256SUMS", checksum)
        checksum = pg.extract_checksum_from_file("SHA256SUMS", "./netboot/netboot.tar.gz")
        pg.checksum_verify_file("netboot.tar.gz", checksum)
        shutil.rmtree("debian-installer", ignore_errors=True)
        with tarfile.open("netboot.tar.gz", "r") as tar_fh:
            tar_fh.extract("./debian-installer/amd64/initrd.gz")
            tar_fh.extract("./debian-installer/amd64/linux")
            tar_fh.close()
        os.rename( "debian-installer/amd64/initrd.gz", f"{build_dir}/initrd-{distro}-amd64")
        os.rename( "debian-installer/amd64/linux", f"{build_dir}/linux-{distro}-amd64")
        shutil.rmtree(f"{args.work_dir}/{distro}", ignore_errors=True)
    os.chdir(args.work_dir)
    pg.prep_orig_tarball(args, "tftpboot")


def fetch_and_build(args):
    """Do all the things."""
    pg.check_debian_templates(args)
    gmt_time = datetime.now(UTC).replace(microsecond=0)
    args.release = gmt_time.strftime("%Y%m%d%H%M%S") #%H%M%s")
    args = pg.get_orig_filename(args)
    args = pg.get_job_paths(args)
    pg.print_build_summary(args)
    response = input("\033[1mBuild and import this package? (y/[n])\033[0m: ")
    if response == "y":
        prep_source(args)
        pg.create_and_import_package(args)

trino.py package trinos-server:

"""
Thing which fetches Trino for addition to frack repo.

https://repo1.maven.org/maven2/io/trino/trino-server
https://mvnrepository.com/artifact/io.trino/trino-main
https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md

FIXME: Figure out where/how to get changelogs.  The only place I have found
release notes is as follows, but it's an elaborate HTML page per release thus
awkward to work with: https://trino.io/docs/current/release/release-452.html
"""
import os
import xml.etree.ElementTree as ET
import pg

def available_releases(args):
    """Lookup what's available at the download server and populate args.available_releases."""
    args.available_releases = {}
    content = pg.fetch_webpage(args, f"{args.project['download_url']}/maven-metadata.xml")
    root = ET.fromstring(content)
    for child in root.findall("./versioning/versions/version"):
        args.available_releases[child.text] = {
            "deb_release": child.text,
            "source_file": f"trino-server-{child.text}.tar.gz",
            "body": f"https://trino.io/docs/current/release/release-{child.text}.html",
        }
    return args


def prep_source(args):
    """Download source file and verify it."""
    os.chdir(args.work_dir)
    source_url = f"{args.project['download_url']}/{args.release}/{args.source_file}"
    pg.fetch_file(args, source_url, args.orig_file)
    pg.fetch_file(args, f"{source_url}.asc", f"{args.orig_file}.asc")
    pg.fetch_file(args, f"{source_url}.sha1", f"{args.orig_file}.sha1")
    pg.gpg_check_signature(args, f"{args.orig_file}.asc", args.orig_file)
    checksum = pg.extract_checksum_from_file(f"{args.orig_file}.sha1", None, "sha1")
    pg.checksum_verify_file(args.orig_file, checksum, "sha1")


def fetch_and_build(args):
    """Do all the things."""
    pg.check_debian_templates(args)
    args = available_releases(args)
    args = pg.select_release(args)
    prep_source(args)
    pg.create_and_import_package(args)

So far there's just one executable, named build_package.

usage: build_package [-h] [-d {bookworm,bullseye,buster}] [-r RELEASE] [-t TAG] [-v VERSION] [--debug] package

Package Getter

options:
  -h, --help            show this help message and exit
  -d {bookworm,bullseye,buster}, --distro {bookworm,bullseye,buster}
                        Debian distro (default=bookworm)
  -r RELEASE, --release RELEASE
                        Package release
  -t TAG, --tag TAG     Upstream release tag
  -v VERSION, --version VERSION
                        Package version (default=1)
  --debug               Debug output

package:
  package               hadoop, hive, hive-standalone-metastore, metabase, minio, netboot-kernels, trino-server, etc.

Finally, in frack we have a local "packages" repository which contains the metadata required for each package we build. This is where we have templates for i.e. {package}.dsc, and the debian/* directory for changelogs, patches, etc. It is typically checked out to the user's homedir.

So there are quite a few moving parts, and integrating a new package/project takes some work. But once you're set up, you can run i.e. "build_package trino-server" and you'll see what's in our repositories already, what's available upstream, and (hopefully) changelogs for more recent releases. You select which release you want to package and the rest is automated.

There are some places where this doesn't work particularly well:

  • If upstream changelogs are not easily access or reformatted as plain text, showing changelogs isn't feasible.
  • If you need to patch before you package, that would go in the "packages" repository mentioned above. But you would probably need different patches and metadata for each release. This framework supports release/version specific metadata, but obviously this is manual effort to package a new upstream release.
  • I hope to build some kind of changelog-notification when a new package version becomes available upstream, but so far it seems like there's too much variation in how different projects publish release information to do this well.