Page MenuHomePhabricator

[Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story

As a Data Engineering SRE, I need to setup and install Apache Iceberg on the Analytics cluster
Why?

So that the team can begin migrating datasets to Iceberg

Success Criteria

  • Apache Iceberg installed and team can being migration work

We want to use this as an opportunity to discourage users from using Hive MapReduce engine. We have chosen not to support iceberg in Hive MR, forcing users to use Spark or Presto to query iceberg tables.

Event Timeline

Ottomata renamed this task from [Iceburg] Debianize and install Spark, Hive metastore, Presto connectors to [Iceb3rg] Debianize and install iceberg support for Spark, Presto, and optionally Hive.Jun 30 2022, 3:53 PM
Ottomata renamed this task from [Iceb3rg] Debianize and install iceberg support for Spark, Presto, and optionally Hive to [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive.
Ottomata updated the task description. (Show Details)
BTullis added subscribers: JAllemandou, Ottomata, BTullis.

I'm very happy to work on this task and to try to help with the Iceberg effort, but it would be useful for me to have a bit more context as to what we believe needs to be done.

The description above doesn't give me much to work with at the moment. I've added @JAllemandou and @Ottomata who can hopefully point me in the right direction.

I see from T311525: Upgrade to latest PrestoDB and enable iceberg support that we already have updated presto debs (0.273.3) and that they are already available on the production presto servers.

btullis@an-presto1001:~$ apt-cache policy presto-server 
presto-server:
  Installed: 0.273.3-1
  Candidate: 0.273.3-1
  Version table:
 *** 0.273.3-1 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

We can see that the analytics_iceberg catalog has been defined on the production cluster:

btullis@an-presto1001:~$ ls -l /etc/presto/catalog/
total 8
-r--r--r-- 1 presto presto 966 Nov 30 16:43 analytics_hive.properties
-r--r--r-- 1 presto presto 844 Jul  6  2022 analytics_iceberg.properties

...so I can do the following from a stats box.

btullis@stat1004:~$ presto --catalog analytics_iceberg
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
presto>

Does that mean that the presto part of this ticket is complete?

Should I be looking into adding iceberg support for Spark3 via conda-analytics?

Presto is complete. If you can do Iceberg for Spark 3 via conda-analytics, I think that's fine.

https://iceberg.apache.org/docs/latest/spark-configuration/

Probably just need some iceberg + spark .jar(s) and ^ spark configs.
Maybe https://central.sonatype.dev/artifact/org.apache.iceberg/iceberg-spark-3.1_2.12/1.1.0 ?

Presto is complete. If you can do Iceberg for Spark 3 via conda-analytics

Might be easier to just make a iceberg-spark .deb to instatll the .jars somewhere, and then puppet to configure spark conf to use them.

xcollazo changed the task status from Open to In Progress.May 2 2023, 12:47 AM
xcollazo claimed this task.
xcollazo updated the task description. (Show Details)