Page MenuHomePhabricator

Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0
Open, MediumPublic

Description

In our Iceberg Working Session we ran out of time before discussing bumping Spark, however there was async support for it.

Our current production version of Spark, 3.1, is ‘deprecated’ on Icebergs support matrix, and there are talks of dropping support. Update: support has been dropped as of Iceberg 1.4.0.

Options:
a) The Spark community released 3.4.0 on April 13 2023. Iceberg just released version 1.3.0 with support for Spark 3.4. This is the bleeding edge, but as with any .0 feature release there is risk of bugs on both Spark and Iceberg. We would have to bump Iceberg as well. We do win the longest runway. Update: Spark 3.4.1 is now available. Second update: Spark 3.5.0 is also now available.
b) The Spark community released 3.3.2 on Feb 17 2023. Iceberg has supported Spark 3.3 since 0.14.0. We already have Iceberg 1.2.1 which supports Spark 3.3, and the 3.3.2 is stable and well tested by now. We get a relatively shorter runway with this.

Whether we bump to 3.3, 3.4, or 3.5 line, we do win a bunch of perf improvements that will go well with T332765.

Migration guides:
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-34-to-35

Considering the migration guide does have breaking changes on syntax like ADD JAR and CSV output defaults (I originally thought there were none), it does seem like we should consider having the new spark version available jointly with the current version for a while. Perhaps by making it available as spark3_4-submit, etc?

In this task we should:

  • Decide whether to bump to Spark 3.3.X, 3.4.X, or 3.5.X line.
  • Decide whether to remove current Spark 3.1.2, or to have it available at the same time for a while.
  • Install it on test cluster. Do sanity tests.
  • Install it on main cluster.

Event Timeline

I've just opened T344266 with a significant reason on why we should prioritize this work. Wanted to track it separately from this one since the rationale is different.

CC @lbowmaker for your consideration.

Copying rationale to move forward with this work from T344266:

While iterating on an Apache Iceberg MERGE INTO on T340861, we hit T342587, in which the MERGE job generates ~55000 small files

The old trick of adding a COALESCE hint did not fix the small files generated by the MERGE INTO. This is so because MERGE generates a custom query plan specific for Iceberg. The COALESCE is added, but not at the right node. See T340861#9093603 for query plan details.

We can go around this with Iceberg's rewrite_data_files(), but it is annoying and basically makes us write the data twice (one with ~55K files, and another time compacting it to 2 or 3 files!). Turns out this is a known issue on Iceberg's MERGE INTO that has been solved, but on Spark 3.2+. See https://github.com/apache/iceberg/pull/6828. TLDR: This is fixed on Iceberg support for Spark 3.3, and backported to Iceberg support for Spark 3.2, but not Spark 3.1. Note that we do not need to bump Iceberg to pickup this fixes, just Spark.

Although this is not a blocker, I believe it is a compelling reason to upgrade Spark.

xcollazo renamed this task from Upgrade Spark to a version that has long term Iceberg support to Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0.Aug 15 2023, 8:13 PM

Just for reference, I checked to make sure that both pyspark 3.3.2 and 3.4.1 are available via conda-forge, since this is where we currently get our pyspark version.
Happily, it seems that they are: https://anaconda.org/conda-forge/pyspark/files

image.png (730×1 px, 127 KB)

We would need to update both the spark docker image version and the version of pyspark in conda-analytics to the same thing.

It might be pretty tricky having both versions of spark available in the same conda environment though. I'll have a think about how we might do this, but having one version would definitely be simpler.

We would need to update both the spark docker image version

Funny, I thought the docker image was running 3.3.x?

It might be pretty tricky having both versions of spark available in the same conda environment though

Ah, yes, I had forgot conda-analytics... certainly makes it harder to have multiple versions available.

I had a chat with @JAllemandou, and he recommends I do an experiment to confirm whether bumping Spark will solve my issue. I agree that is a good exercise. So I will try to see if I can launch Spark 3.3 or 3.4 independently of the installed version. Should be fun!

We would need to update both the spark docker image version

Funny, I thought the docker image was running 3.3.x?

You're right, the spark-operator is running spark 3.3.0 but all I had to do was build an older version of the image so that we could copy the yarn shuffler into conda-analytics.

It might be pretty tricky having both versions of spark available in the same conda environment though

Ah, yes, I had forgot conda-analytics... certainly makes it harder to have multiple versions available.

I had a chat with @JAllemandou, and he recommends I do an experiment to confirm whether bumping Spark will solve my issue. I agree that is a good exercise. So I will try to see if I can launch Spark 3.3 or 3.4 independently of the installed version. Should be fun!

OK, sounds good. Remember when you are testing that the yarn shuffler is at version 3.1.2 so this might complicate the bug-hunting even further. I'll wait to hear about the results of your tests before taking any further action on this ticket.

Coming back here to report.

TLDR: Spark 3.3+ solves my small files problem with MERGE INTO.

Longer story over at T340861#9101939 and T340861#9101939.

Remember when you are testing that the yarn shuffler is at version 3.1.2 so this might complicate the bug-hunting even further.

Indeed, Spark 3.3 is not compatible with the 3.1.2 shuffler (details on links above). I was under the impression that they had a stable interface between executors and the shuffle service, but I guess not? I could not pinpoint this incompatibility on the release notes...

Given the debugging steps at T340861, I believe that I could unblock myself by building a custom conda environment with Spark 3.3 or 3.4 and use our Airflow skein operator with use_virtualenv_spark=True. This flag is marked experimental, and AFAIK, no one has used or tested it. Given the shuffler issue, this also means that I would have to hard code the executors rather than use dynamic allocation.

But, it is an option I want to disclose for if embarking on a full on upgrade of Spark right now is not feasible.

BTullis triaged this task as High priority.Aug 23 2023, 2:34 PM
BTullis moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.

Another update: Spark 3.3.3 has just been released, as of August 21st. https://spark.apache.org/releases/spark-release-3-3-3.html

Should I start working with version 3.3.3, or do you think it would be better to use version 3.4.1 (https://spark.apache.org/releases/spark-release-3-4-1.html) or do you think that there is value in trying to build and test both versions?
(cc: @xcollazo @JAllemandou @Milimetric )

Re 3.3 vs 3.4, I am yet do do any tests on 3.4.

But actually, @BTullis , since I suspect that my current blocking issue (T340861#9101939) in Spark 3.3 is due to the fact that I am running it without an external shuffler, what would help me move T340861 forward right now would be the availability of a Spark 3.3 Shuffle Service so that I can test my hypothesis.

It seems possible to run two separate shufflers on Yarn: https://spark.apache.org/docs/latest/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service

This looks like way less work than a full upgrade for now, and presumably this is not toss away work since the same mechanism can be used to launch multiple shufflers for 3.3 and 3.4, which folks could use on as as needed basis like me.

Re 3.3 vs 3.4, I am yet do do any tests on 3.4.

But actually, @BTullis , since I suspect that my current blocking issue (T340861#9101939) in Spark 3.3 is due to the fact that I am running it without an external shuffler, what would help me move T340861 forward right now would be the availability of a Spark 3.3 Shuffle Service so that I can test my hypothesis.

It seems possible to run two separate shufflers on Yarn: https://spark.apache.org/docs/latest/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service

This looks like way less work than a full upgrade for now, and presumably this is not toss away work since the same mechanism can be used to launch multiple shufflers for 3.3 and 3.4, which folks could use on as as needed basis like me.

Gotcha! I'm totally guided by you on this, so your idea seems like an efficient use of our time. We can put off a full upgrade of spark until a later time, and as you say the work to enable multiple shuffler services is unlikely to to go waste.
The fact that we currently ship our yarn shuffler service jars with conda-analytics makes it a bit more tricky to run multiple shufflers in parallel, but I'll maybe look at how to extract it and install them separately.

Shall we decline this ticket and create a new one for enabling multiple shuffler services? The approach seems sufficiently different from this one that it makes sense not to re-use this one.

BTullis claimed this task.

I created T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel to track the work on enabling multiple yarn/spark shuffler services.

Shall we decline this ticket and create a new one for enabling multiple shuffler services?

I think this ticket is still relevant medium term. We can deprioritize? But agreed we should split.

The fact that we currently ship our yarn shuffler service jars with conda-analytics

Ah, but I think we don't, and you had found out about this on T332765#8864203. So we should be able to generate the jar similarly as you did on that ticket from the spark docker images?

Gehel moved this task from In Progress to Misc on the Data-Platform-SRE board.
Gehel subscribed.

Re-opening, Spark 3.x upgrade is still relevant in the medium term

The fact that we currently ship our yarn shuffler service jars with conda-analytics

Ah, but I think we don't, and you had found out about this on T332765#8864203. So we should be able to generate the jar similarly as you did on that ticket from the spark docker images?

You're right, we do build it as part of the docker build pipeline.
However, we currently deploy only it to the hadoop workers as part of the conda-analytics conda environment, here.

Anyway, I don't think it's too big a problem. We could even package each spark shuffler jar in its own debian package if we wanted, or something similar.

Shall we decline this ticket and create a new one for enabling multiple shuffler services?

I think this ticket is still relevant medium term. We can deprioritize? But agreed we should split.

Yes, I agree that there is still value in it, I was just trying to be a bit too tidy. Apologies for jumping the gun there.

BTullis lowered the priority of this task from High to Medium.Aug 25 2023, 4:58 PM

Update: We have now got three versions of the spark shuffler running in production:

  • 3.1.2
  • 3.3.2
  • 3.4.1

Our production pipelines are all still running on spark version 3.1.2 and this is still what is built into conda-analytics.

I'm currently releasing a new version 0.0.24 of conda-analytics, but this doesn't yet change the version of pyspark that we download from conda-forge, so it is still 3.1.2.
I'm ready whenever you want to go ahead with this upgrade of spark in production, but we will need to coordinate changes in several different repositories at once.

i.e.

So this is going to need a real team effort to co-ordinate, even for a minor version upgrade like this.
Let me know what you think about when we should plan to implement the changes. I can start to prepare the patches.

Let me know what you think about when we should plan to implement the changes.

Both Dumps 2.0 work and Iceberg migrations will benefit from this, but this upgrade does not block those two work streams.

Given that December holidays are approaching, I would suggest that we leave the deployment of this to January or later, after folks are back.

Personally, I am committed to other work this month, but I am more than happy to help with this effort in January!