Page MenuHomePhabricator

Add support for Iceberg in Spark
Closed, ResolvedPublic3 Estimated Story Points

Description

Adding support for Iceberg in Spark 3.1.x is as simple as:

spark3-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive

The above jar, iceberg-spark-runtime-3.1_2.12:1.2.1, provides the runtime dependencies for Iceberg 1.2.1 to be used in the Spark 3.1.x series, compiled against Scala 2.12.

However, we'd like the runtime dependency and configuration to be available automatically so that users don't need to think about it.

In this task we should make it so.

Details

Event Timeline

xcollazo changed the task status from Open to In Progress.May 2 2023, 12:57 AM
xcollazo set the point value for this task to 3.
xcollazo edited projects, added Data Pipelines (Sprint 12); removed Data Pipelines.
xcollazo moved this task from Next Up to In Progress on the Data Pipelines (Sprint 12) board.

Change 914928 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Add configs to spark-defaults.conf to enable Iceberg.

https://gerrit.wikimedia.org/r/914928

Next steps:

  1. We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
  2. Then @xcollazo can verify (1) works as expected.
  3. Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
  4. After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
  5. @xcollazo to verify (4) works as expected.

@BTullis or @Stevemunene can one of you please help with step (1) above? IIRC, we have done this before by disabling puppet on a test client node, and then manually deploying the debian package.

Done with (1) and (2). Next step is for @Stevemunene to deploy debian package cluster wide (step (3)). IIRC, we need to deploy it to both the bullseye as well as the buster repo.

  1. We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
  2. Then @xcollazo can verify (1) works as expected.
  3. Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
  4. After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
  5. @xcollazo to verify (4) works as expected.

As per the steps outlined above, starting on roll out of version 0.0.13 of the conda-analytics deb with iceberg support.

On apt1001, confirm the package details we are about to upgrade

stevemunene@apt1001:~$ sudo -i reprepro ls conda-analytics
conda-analytics | 0.0.12 |   buster-wikimedia | amd64
conda-analytics | 0.0.12 | bullseye-wikimedia | amd64

Available is 0.0.12 we are upgrading to 0.0.13
get the artifact from GitLab

stevemunene@apt1001:~$ wget -O conda-analytics-0.0.13_amd64.deb https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/1252/download
--2023-05-10 14:50:35--  https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/1252/download
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 950654444 (907M) [application/octet-stream]
Saving to: ‘conda-analytics-0.0.13_amd64.deb’

conda-analytics-0.0.13_amd64.de 100%[====================================================>] 906.61M   102MB/s    in 8.6s    

2023-05-10 14:50:44 (106 MB/s) - ‘conda-analytics-0.0.13_amd64.deb’ saved [950654444/950654444]

Upgrade the existing package with reprepro and verify.

stevemunene@apt1001:~$  sudo -i reprepro includedeb buster-wikimedia /home/stevemunene/conda-analytics-0.0.13_amd64.deb 
Exporting indices...
stevemunene@apt1001:~$  sudo -i reprepro includedeb bullseye-wikimedia /home/stevemunene/conda-analytics-0.0.13_amd64.deb 
Exporting indices...
Deleting files no longer referenced...
stevemunene@apt1001:~$ sudo -i reprepro ls conda-analytics
conda-analytics | 0.0.13 |   buster-wikimedia | amd64
conda-analytics | 0.0.13 | bullseye-wikimedia | amd64
stevemunene@apt1001:~$

Mentioned in SAL (#wikimedia-analytics) [2023-05-10T16:50:44Z] <stevemunene> deploy conda-analytics v0.0.13 T335721

Generate a debdeploy spec.

stevemunene@cumin1001:~$ generate-debdeploy-spec
Please enter the name of source package (e.g. openssl). type '' or 'quit' to abort
>conda-analytics
Enter an optional comment, e.g. a reference to a security advisory or a CVE ID mapping
>T335721

tool           -> The updated packages is an enduser tool, can be
                  rolled-out immediately.
daemon-direct  -> Daemons which are restarted during update, but which
                  do no affect existing users.
daemon-disrupt -> Daemons which are restarted during update, where the
                  users notice an impact. The update procedure is almost
                  identical, but displays additional warnings
library        -> After a library is updated, programs may need to be
                  restarted to fully effect the change. In addition
                  to libs, some applications may also fall under this rule,
                  e.g. when updating QEMU, you might need to restart VMs.

Please enter the update type:
>tool
Please enter the version of conda-analytics fixed in bullseye. Leave blank if no fix is available/required for bullseye.
>0.0.13
Please enter the version of conda-analytics fixed in buster. Leave blank if no fix is available/required for buster.
>0.0.13
Please enter the version of conda-analytics fixed in stretch. Leave blank if no fix is available/required for stretch.
>

Usually every upgrade only modifies existing package names. There are rare exceptions
e.g. if a rebase to a new upstream release is necessary.

Enter an optional comma-separated list of binary package names
which are being switched to a new name.
Leave blank to skip
>
Spec file created as 2023-05-10-conda-analytics.yaml
stevemunene@cumin1001:~$ ls
2023-05-10-conda-analytics.yaml
stevemunene@cumin1001:~$

Use debdeploy to roll out the packages to all conda_analytics hosts:

stevemunene@cumin1001:~$ sudo debdeploy deploy -u 2023-05-10-conda-analytics.yaml -Q P:analytics::conda_analytics
Rolling out conda-analytics:
Non-daemon update, no service restart needed

    
conda-analytics was updated: 0.0.12 -> 0.0.13
  an-airflow[1002-1005].eqiad.wmnet,an-
coord[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-test-
coord1001.eqiad.wmnet,an-test-worker[1001-1003].eqiad.wmnet,an-worker[
1078-1148].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,stat[1004-1008
].eqiad.wmnet (107 hosts)

These hosts are already up-to-date:
  an-test-client[1001-1002].eqiad.wmnet (2 hosts)

stevemunene@cumin1001:~$

And step 3 is done

Done with (1), (2) and (3). Now @BTullis to deploy the new spark-assembly.jar via https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670.

  1. We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
  2. Then @xcollazo can verify (1) works as expected.
  3. Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
  4. After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
  5. @xcollazo to verify (4) works as expected.

Now @BTullis to deploy the new spark-assembly.jar via https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670

It turns out that there are some problems with that patch, so we need some further thinking about how we install the assembly.

For now, I think the simplest thing is if we run the script that you mentioned manually.

I've been discussing with @Antoine_Quhen how we could work to automate this in future, but we haven't got a clear way forward yet. Antoine would prefer that we didn't make the conda-analytics package even larger (~230 MB larger) by adding the assembly file into it.

So I can have a look at running that script later today if possible (unless any other SRE wishes to do so before I get a chance) and we can work on longer term improvements to the automation later.

It turns out that there are some problems with that patch, so we need some further thinking about how we install the assembly.

Fair enough. Let's pursue that fix separately, will open a ticket.

For now, I think the simplest thing is if we run the script that you mentioned manually.

So I can have a look at running that script later today if possible (unless any other SRE wishes to do so before I get a chance)

I just ran it. I have hdfs sudo privileges. So we are good!

Also have some fixes for the script here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/19.

Done with (1), (2), (3) and (3A).

Next step is (4): Deploy https://gerrit.wikimedia.org/r/914928.

  1. We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
  2. Then @xcollazo can verify (1) works as expected.
  3. Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
    • (3A) Adding an extra step I had forgot: We now need to deploy the spark-assembly.jar to HDFS. (Note: @BTullis and @Antoine_Quhen to figure out a long term automation for this on T336513. )
  4. After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
  5. @xcollazo to verify (4) works as expected.

Change 914928 merged by Btullis:

[operations/puppet@production] Add configs to spark-defaults.conf to enable Iceberg.

https://gerrit.wikimedia.org/r/914928

I have now completed step (4) above, so the changes are being rolled out everywhere now.

Done wit all steps! 🎉 Thank you @Stevemunene and @BTullis for your support!

I just verified that:
(a) Spark 2.X jobs keep running fine as per https://yarn.wikimedia.org/
(b) Spark 3.X jobs include the iceberg conf files and are running fine as per https://yarn.wikimedia.org/
(c) Did some sanity tests on a Jupyter notebook with wmfdata:

import wmfdata
spark = wmfdata.spark.create_session(type='yarn-large')

configs = spark.sparkContext.getConf().getAll()
for c in configs:
    if 'iceberg' in c[1]:
        print(c)
('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')

df = spark.sql("select count(1) from referrer_daily_iceberg")
df.show()
[Stage 0:>                                                          (0 + 1) / 1]
+--------+
|count(1)|
+--------+
|    9440|
+--------+

df = spark.sql("select day, os_family, count(num_referrals) from referrer_daily_iceberg group by day, os_family")
df.show()
+----------+-------------+--------------------+
|       day|    os_family|count(num_referrals)|
+----------+-------------+--------------------+
|<redacted>|   <redacted>|          <redacted>|
...
+----------+-------------+--------------------+
  1. We need an SRE to publish and manually deploy the conda-analytics 0.0.13 debian package on a test client node.
  2. Then @xcollazo can verify (1) works as expected.
  3. Assuming (2) is successful, SRE to deploy that package to all cluster nodes. The only change is the addition of the iceberg jar, so it is low risk.
    • (3A) Adding an extra step I had forgot: We now need to deploy the spark-assembly.jar to HDFS. (Note: @BTullis and @Antoine_Quhen to figure out a long term automation for this on T336513. )
  4. After (3), and after review of config changes at https://gerrit.wikimedia.org/r/914928, an SRE can puppet deploy the config changes. This should also be low risk given the new configs only affect iceberg tables.
  5. @xcollazo to verify (4) works as expected.