Add support for Iceberg in Spark
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	xcollazo
	May 2 2023, 12:52 AM

Description

Adding support for Iceberg in Spark 3.1.x is as simple as:

spark3-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive

The above jar, iceberg-spark-runtime-3.1_2.12:1.2.1, provides the runtime dependencies for Iceberg 1.2.1 to be used in the Spark 3.1.x series, compiled against Scala 2.12.

However, we'd like the runtime dependency and configuration to be available automatically so that users don't need to think about it.

In this task we should make it so.

Details

	Subject	Repo	Branch	Lines +/-
	Add configs to spark-defaults.conf to enable Iceberg.	operations/puppet	production	+14 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T311743 [Iceberg] Epic: Icebergify event_sanitized database
Open	None	T333013 [Iceberg Migration] Apache Iceberg Migration
Resolved	xcollazo	T311738 [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive
Resolved	xcollazo	T335721 Add support for Iceberg in Spark

Event Timeline

xcollazo created this task.May 2 2023, 12:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2023, 12:52 AM

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/18

Adds support for Iceberg 1.2.1 as a runtime dependency of Spark 3.1.2.

gerritbot added a project: Patch-For-Review.May 2 2023, 12:54 AM

xcollazo changed the task status from Open to In Progress.May 2 2023, 12:57 AM

xcollazo set the point value for this task to 3.

xcollazo edited projects, added Data Pipelines (Sprint 12); removed Data Pipelines.

xcollazo moved this task from Next Up to In Progress on the Data Pipelines (Sprint 12) board.

In due course, we will also want to make sure that this jar gets added to the spark container:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/spark/

Change 914928 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Add configs to spark-defaults.conf to enable Iceberg.

https://gerrit.wikimedia.org/r/914928

xcollazo moved this task from In Progress to In Review on the Data Pipelines (Sprint 12) board.May 4 2023, 8:45 PM

In T335721#8824381, @BTullis wrote:

In due course, we will also want to make sure that this jar gets added to the spark container:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/spark/

Opened T336012 to follow up on that.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/18

Add support for Iceberg 1.2.1 as a runtime dependency of Spark 3.1.2.

Next steps:

We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
Then @xcollazo can verify (1) works as expected.
Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
@xcollazo to verify (4) works as expected.

@BTullis or @Stevemunene can one of you please help with step (1) above? IIRC, we have done this before by disabling puppet on a test client node, and then manually deploying the debian package.

Done with (1) and (2). Next step is for @Stevemunene to deploy debian package cluster wide (step (3)). IIRC, we need to deploy it to both the bullseye as well as the buster repo.

~~We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.~~
~~Then @xcollazo can verify (1) works as expected.~~
Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
@xcollazo to verify (4) works as expected.

As per the steps outlined above, starting on roll out of version 0.0.13 of the conda-analytics deb with iceberg support.

On apt1001, confirm the package details we are about to upgrade

stevemunene@apt1001:~$ sudo -i reprepro ls conda-analytics
conda-analytics | 0.0.12 |   buster-wikimedia | amd64
conda-analytics | 0.0.12 | bullseye-wikimedia | amd64

Available is 0.0.12 we are upgrading to 0.0.13
get the artifact from GitLab

stevemunene@apt1001:~$ wget -O conda-analytics-0.0.13_amd64.deb https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/1252/download
--2023-05-10 14:50:35--  https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/1252/download
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 950654444 (907M) [application/octet-stream]
Saving to: ‘conda-analytics-0.0.13_amd64.deb’

conda-analytics-0.0.13_amd64.de 100%[====================================================>] 906.61M   102MB/s    in 8.6s    

2023-05-10 14:50:44 (106 MB/s) - ‘conda-analytics-0.0.13_amd64.deb’ saved [950654444/950654444]

Upgrade the existing package with reprepro and verify.

stevemunene@apt1001:~$  sudo -i reprepro includedeb buster-wikimedia /home/stevemunene/conda-analytics-0.0.13_amd64.deb 
Exporting indices...
stevemunene@apt1001:~$  sudo -i reprepro includedeb bullseye-wikimedia /home/stevemunene/conda-analytics-0.0.13_amd64.deb 
Exporting indices...
Deleting files no longer referenced...
stevemunene@apt1001:~$ sudo -i reprepro ls conda-analytics
conda-analytics | 0.0.13 |   buster-wikimedia | amd64
conda-analytics | 0.0.13 | bullseye-wikimedia | amd64
stevemunene@apt1001:~$

xcollazo moved this task from In Review to Ready to Deploy on the Data Pipelines (Sprint 12) board.May 10 2023, 4:09 PM

Mentioned in SAL (#wikimedia-analytics) [2023-05-10T16:50:44Z] <stevemunene> deploy conda-analytics v0.0.13 T335721

Generate a debdeploy spec.

stevemunene@cumin1001:~$ generate-debdeploy-spec
Please enter the name of source package (e.g. openssl). type '' or 'quit' to abort
>conda-analytics
Enter an optional comment, e.g. a reference to a security advisory or a CVE ID mapping
>T335721

tool           -> The updated packages is an enduser tool, can be
                  rolled-out immediately.
daemon-direct  -> Daemons which are restarted during update, but which
                  do no affect existing users.
daemon-disrupt -> Daemons which are restarted during update, where the
                  users notice an impact. The update procedure is almost
                  identical, but displays additional warnings
library        -> After a library is updated, programs may need to be
                  restarted to fully effect the change. In addition
                  to libs, some applications may also fall under this rule,
                  e.g. when updating QEMU, you might need to restart VMs.

Please enter the update type:
>tool
Please enter the version of conda-analytics fixed in bullseye. Leave blank if no fix is available/required for bullseye.
>0.0.13
Please enter the version of conda-analytics fixed in buster. Leave blank if no fix is available/required for buster.
>0.0.13
Please enter the version of conda-analytics fixed in stretch. Leave blank if no fix is available/required for stretch.
>

Usually every upgrade only modifies existing package names. There are rare exceptions
e.g. if a rebase to a new upstream release is necessary.

Enter an optional comma-separated list of binary package names
which are being switched to a new name.
Leave blank to skip
>
Spec file created as 2023-05-10-conda-analytics.yaml
stevemunene@cumin1001:~$ ls
2023-05-10-conda-analytics.yaml
stevemunene@cumin1001:~$

Use debdeploy to roll out the packages to all conda_analytics hosts:

stevemunene@cumin1001:~$ sudo debdeploy deploy -u 2023-05-10-conda-analytics.yaml -Q P:analytics::conda_analytics
Rolling out conda-analytics:
Non-daemon update, no service restart needed

    
conda-analytics was updated: 0.0.12 -> 0.0.13
  an-airflow[1002-1005].eqiad.wmnet,an-
coord[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-test-
coord1001.eqiad.wmnet,an-test-worker[1001-1003].eqiad.wmnet,an-worker[
1078-1148].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,stat[1004-1008
].eqiad.wmnet (107 hosts)

These hosts are already up-to-date:
  an-test-client[1001-1002].eqiad.wmnet (2 hosts)

stevemunene@cumin1001:~$

And step 3 is done

Done with (1), (2) and (3). Now @BTullis to deploy the new spark-assembly.jar via https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670.

~~We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.~~
~~Then @xcollazo can verify (1) works as expected.~~
~~Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.~~
- (3A) Adding an extra step I had forgot: We now need to deploy the spark-assembly.jar to HDFS. We have a script for that, but @BTullis has offered to fix our puppet code to take care of this part automatically via https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670.
After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
@xcollazo to verify (4) works as expected.

Now @BTullis to deploy the new spark-assembly.jar via https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670

It turns out that there are some problems with that patch, so we need some further thinking about how we install the assembly.

For now, I think the simplest thing is if we run the script that you mentioned manually.

I've been discussing with @Antoine_Quhen how we could work to automate this in future, but we haven't got a clear way forward yet. Antoine would prefer that we didn't make the conda-analytics package even larger (~230 MB larger) by adding the assembly file into it.

So I can have a look at running that script later today if possible (unless any other SRE wishes to do so before I get a chance) and we can work on longer term improvements to the automation later.

It turns out that there are some problems with that patch, so we need some further thinking about how we install the assembly.

Fair enough. Let's pursue that fix separately, will open a ticket.

For now, I think the simplest thing is if we run the script that you mentioned manually.

So I can have a look at running that script later today if possible (unless any other SRE wishes to do so before I get a chance)

I just ran it. I have hdfs sudo privileges. So we are good!

Also have some fixes for the script here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/19.

xcollazo mentioned this in T336513: Figure out a way to automatize deployment of the spark assembly file.May 11 2023, 3:10 PM

Done with (1), (2), (3) and (3A).

Next step is (4): Deploy https://gerrit.wikimedia.org/r/914928.

~~We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.~~
~~Then @xcollazo can verify (1) works as expected.~~
~~Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.~~
- ~~(3A) Adding an extra step I had forgot: We now need to deploy the spark-assembly.jar to HDFS.~~ (Note: @BTullis and @Antoine_Quhen to figure out a long term automation for this on T336513. )
After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
@xcollazo to verify (4) works as expected.

Change 914928 merged by Btullis:

[operations/puppet@production] Add configs to spark-defaults.conf to enable Iceberg.

https://gerrit.wikimedia.org/r/914928

I have now completed step (4) above, so the changes are being rolled out everywhere now.

Maintenance_bot removed a project: Patch-For-Review.May 12 2023, 9:12 AM

Done wit all steps! 🎉 Thank you @Stevemunene and @BTullis for your support!

I just verified that:
(a) Spark 2.X jobs keep running fine as per https://yarn.wikimedia.org/
(b) Spark 3.X jobs include the iceberg conf files and are running fine as per https://yarn.wikimedia.org/
(c) Did some sanity tests on a Jupyter notebook with wmfdata:

import wmfdata
spark = wmfdata.spark.create_session(type='yarn-large')

configs = spark.sparkContext.getConf().getAll()
for c in configs:
    if 'iceberg' in c[1]:
        print(c)
('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')

df = spark.sql("select count(1) from referrer_daily_iceberg")
df.show()
[Stage 0:>                                                          (0 + 1) / 1]
+--------+
|count(1)|
+--------+
|    9440|
+--------+

df = spark.sql("select day, os_family, count(num_referrals) from referrer_daily_iceberg group by day, os_family")
df.show()
+----------+-------------+--------------------+
|       day|    os_family|count(num_referrals)|
+----------+-------------+--------------------+
|<redacted>|   <redacted>|          <redacted>|
...
+----------+-------------+--------------------+

~~We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.~~
~~Then @xcollazo can verify (1) works as expected.~~
~~Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.~~
- ~~(3A) Adding an extra step I had forgot: We now need to deploy the spark-assembly.jar to HDFS.~~ (Note: @BTullis and @Antoine_Quhen to figure out a long term automation for this on T336513. )
After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
~~@xcollazo to verify (4) works as expected.~~

BTullis mentioned this in T332765: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3.May 12 2023, 3:03 PM

xcollazo mentioned this in T335314: Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data.May 15 2023, 8:14 PM

BTullis mentioned this in T337471: Jupyterhub is broken in conda-analytics 0.0.13.May 25 2023, 1:24 PM

xcollazo mentioned this in T378819: Bump Iceberg to 1.3.1; the most recent version that still supports Spark 3.1.2.Fri, Nov 1, 2:53 PM

Add support for Iceberg in SparkClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add support for Iceberg in Spark
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...