Page MenuHomePhabricator

Upgrade Spark to a version with long term Iceberg support
Open, HighPublic

Description

In our Iceberg Working Session we ran out of time before discussing bumping Spark, however there was async support for it.

Starting with v1.4, Iceberg has dropped support for Spark 3.1, our current production version.

Options:
a) The Spark community released 3.4.0 on April 13 2023. Iceberg just released version 1.3.0 with support for Spark 3.4. This is the bleeding edge, but as with any .0 feature release there is risk of bugs on both Spark and Iceberg. We would have to bump Iceberg as well. We do win the longest runway. Update: Spark 3.4.1 is now available. Second update: Spark 3.5.0 is also now available.
b) The Spark community released 3.3.2 on Feb 17 2023. Iceberg has supported Spark 3.3 since 0.14.0. We already have Iceberg 1.2.1 which supports Spark 3.3, and the 3.3.2 is stable and well tested by now. We get a relatively shorter runway with this.

Whether we bump to 3.3, 3.4, or 3.5 line, we do win a bunch of perf improvements that will go well with T332765.

Migration guides:
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-34-to-35

Considering the migration guide does have breaking changes on syntax like ADD JAR and CSV output defaults (I originally thought there were none), it does seem like we should consider having the new spark version available jointly with the current version for a while. Perhaps by making it available as spark3_4-submit, etc?

In this task we should:

  • Decide whether to bump to Spark 3.3.X, 3.4.X, or 3.5.X line.
    • We will target version 3.5.3 (for now)
  • Decide whether to remove current Spark 3.1.2, or to have it available at the same time for a while.
    • We can't realistically do this at present
  • Install it on test cluster. Do sanity tests.
  • Install it on main cluster.
  • Allow users to migrate their own jobs from Spark 3.1.2 to Spark 3.5.8
  • Configure the use of Spark 3.5.8 by default

See also

Details

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenBTullis
OpenNone
ResolvedBTullis
OpenNone
OpenNone
OpenNone
DeclinedNone
OpenBTullis
ResolvedBTullis
Resolvedbrouberol
DeclinedNone
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
OpenNone
ResolvedBTullis
ResolvedBTullis
Resolvedamastilovic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1093393 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery/source@master] Update Spark to version 3.5.3

https://gerrit.wikimedia.org/r/1093393

In taking over the work @BTullis had started in the patch just above (thanks again :), I have discovered a blocker for this migration.

We currently use Spark 3.1.2 which depends on Hadoop 3.2.0. This could already be problematic as we use Hadoop 2.10.1 (see T379385), but as the 2.10 stream of version where backports of 3.X, it has worked for us.

The newer spark versions depend on newer versions of Hadoop. I'm writing a compatibility matrix here, I have not found one on the Internet...

Spark versionHadoop version
3.2.X3.3.1
3.3.X3.3.2
3.4.X3.3.4
3.5.X3.3.4

Unfortunately the Hadoop version 3.3.1 (and newer) have (at least) one API breaking change preventing us to move forward.

The API breaking change I have investigated happens on the org.apache.hadoop.mapreduce.lib.input.LineRecordReader object:

This news is sad as we badly need newer spark versions, but on the bright side it will push us to upgrade Hadoop faster :)

Re T338057#10356492, we were not able to reproduce the issue that @JAllemandou sees in unit tests. We were able to successfully read JSON data from a notebook with Spark 3.3.2 and our current Hadoop system.

Longer:

We setup a Spark 3.3.2 Notebook and verified as follows:

%env SPARK_HOME=/home/xcollazo/.conda/envs/spark33/lib/python3.10/site-packages/pyspark
env: SPARK_HOME=/home/xcollazo/.conda/envs/spark33/lib/python3.10/site-packages/pyspark
%env SPARK_CONF_DIR=/etc/spark3/conf
env: SPARK_CONF_DIR=/etc/spark3/conf
!echo $SPARK_HOME
!echo $SPARK_CONF_DIR
/home/xcollazo/.conda/envs/spark33/lib/python3.10/site-packages/pyspark
/etc/spark3/conf
import wmfdata
spark = wmfdata.spark.create_custom_session(
    master='yarn',
    spark_config={
        "spark.shuffle.service.name": 'spark_shuffle_3_3',
        "spark.shuffle.service.port": '7338',
        "spark.yarn.archive": "hdfs:///user/spark/share/lib/spark-3.3.2-assembly.zip",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "20g",
        "spark.driver.memory": "32g",
        "spark.driver.cores": "4",
        "spark.executor.cores": "2",
        "spark.sql.shuffle.partitions": 5120,
        "spark.driver.maxResultSize": "8G",
        ##
        # extras to make Iceberg work on 3.3.2:
        ##
        "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.6.1",
        "spark.jars.ivySettings": "/etc/maven/ivysettings.xml",  # fix jar pulling
        "spark.sql.iceberg.locality.enabled" : "true",
    }
)
You are using Wmfdata v2.3.0, but v2.4.0 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md.



SPARK_HOME: /home/xcollazo/.conda/envs/spark33/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at hadoop-client-api-3.3.2.jar
hadoop-client-runtime-3.3.2.jar, provided by Spark.
PYSPARK_PYTHON=/home/xcollazo/.conda/envs/spark33/bin/python
:: loading settings :: file = /etc/maven/ivysettings.xml
:: loading settings :: url = jar:file:/srv/home/xcollazo/.conda/envs/spark33/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/xcollazo/.ivy2/cache
The jars for the packages stored in: /home/xcollazo/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1f6e7495-c014-4f16-a3c7-4a88fd57681e;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.6.1 in mirrored
:: resolution report :: resolve 166ms :: artifacts dl 3ms
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-1f6e7495-c014-4f16-a3c7-4a88fd57681e
	confs: [default]
	0 artifacts copied, 1 already retrieved (0kB/4ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/12/02 17:37:40 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/12/02 17:37:41 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
24/12/02 17:37:41 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
24/12/02 17:37:41 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
24/12/02 17:37:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/12/02 17:37:41 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/12/02 17:37:41 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/12/02 17:37:44 WARN Client: Same path resource file:///srv/home/xcollazo/.ivy2/jars/org.apache.iceberg_iceberg-spark-runtime-3.3_2.12-1.6.1.jar added multiple times to distributed cache.
24/12/02 17:37:49 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
24/12/02 17:37:49 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
24/12/02 17:37:49 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
24/12/02 17:37:49 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
spark.version
'3.3.2'
!hdfs dfs -ls /wmf/data/raw/webrequest/webrequest_text/year=2024/month=11/day=01/hour=00 | grep txt.gz | wc -l
168
(spark
   .read
   .json("/wmf/data/raw/webrequest/webrequest_text/year=2024/month=11/day=01/hour=00")
   .show(10)
)
                                                                                

+--------------------+---------------+--------------------+------------+--------------------+--------------------+------------------+-----------+-----------+--------------------+-----+--------------------+-------------+-----------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|              accept|accept_language|             backend|cache_status|        content_type|                  dt|          hostname|http_method|http_status|                  ip|range|             referer|response_size|   sequence|time_firstbyte|                 tls|           uri_host|            uri_path|           uri_query|          user_agent|         x_analytics|             x_cache|
+--------------------+---------------+--------------------+------------+--------------------+--------------------+------------------+-----------+-----------+--------------------+-----+--------------------+-------------+-----------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| <redacted>
+--------------------+---------------+--------------------+------------+--------------------+--------------------+------------------+-----------+-----------+--------------------+-----+--------------------+-------------+-----------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 10 rows

I have managed to find a way out of the test issue with dependencies exclusions. There still are 2 issues:

  • One test in PagesPartitionsDefinerSpec is failing. We need to investigate as this could be related to a change in spark partition handling.
  • The anomaly-detection code relies on an old package (criteo-rsvd). We'll probably need to fork the original code to update it.

Unassigning myself for now, since this is currently inactive.

(marking this as its own epic, and detaching from T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2) to avoid confusion.)

One more +1 for Spark 3.5.

My use case is that I want to write directly to the data lake from a Airflow job running an Elixir application. There is no native or Java driver capable of making SQL queries from the BEAM (Elixir VM) so I was only able to integrate ( https://gitlab.com/wmde/technical-wishes/apache_hive_ex/ ) using HiveServer2, which is totally deprecated and in the end runs unacceptably slowly (averaging 20 inserted rows per second, when run using INSERT...VALUES... statements with a batch of 100 rows). The thick Spark client used by pyspark is unavailable for my runtime environment since it relies on a tightly-coupled Java process.

What I hope for the future is that we upgrade to minimum Spark 3.4 because this is the first version where the thin "Spark Connect" API becomes available. This is a gRPC service which can integrate with any runtime environment.

I think that I have a potential way forward for this Spark 3.5 upgrade, which should address the difficulties mentioned in T423052, T424381, and T424359.
It's going to take some work, but I think that it will be do-able.

The elements are as follows:

  1. Create a spark-358.deb package containing these items
  • Distribute this to all Hadoop nodes. The package structure will be like this:
/opt/spark/3.5.8/
  jars/
    iceberg-spark-runtime-3.5_2.12-1.7.1.jar
    (all other Spark jars)
  yarn/
    spark-3.5.8-yarn-shuffle.jar
  conf/
    spark-env.sh           <-- sets SPARK_DIST_CLASSPATH, HADOOP_CONF_DIR
    spark-defaults.conf    <-- sets Iceberg catalog config
  1. Create a conda-analytics-35.deb package, containing these items.
    • A python 3.10 environment (or 3.11 or 3.12 are valid targets, if you wish to upgrade)
    • A PySpark 3.5.8 distribution from conda-forge
    • wmfdata, pandas, numpy, pyarrow etc. - All of the existing data processing tools that we expect to find in conda-analytics.
    • Notably, this will exclude the jupyterhub components and the ability to do conda-analytics-clone - I think that jupyter can be moved to its own conda environment - see T321512: Install jupyterhub separately from conda-analytics
  • Distribute this to the Hadoop workers. The files will live under /opt/conda-analytics-35
  1. Create wrappers like this: /usr/bin/spark35-submit
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark/3.5.8
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export PYSPARK_PYTHON=/opt/conda-analytics-35/bin/python3
exec $SPARK_HOME/bin/spark-submit "$@"
  1. Add the spark 3.5 shuffler for YARN to all worker nodes, in the same way that we are currenly doing for 3,1, 3,3, and 3.4

So I think that this will allow us to have Spark 3.5.8 working with Hadoop 2.10.2 with full compatibility for Scala jobs and not just the pyspark environment.

We will use the SPARK_DIST_CLASSPATH to make sure that our Hadoop 2.10.2 jars are loaded before any of the other Hadoop jars, such as the ones that come bundled with PySpark.

@xcollazo what do you think? Can you see any problems with this?

Doing it this way, will mean that we can leave our existing spark 3.1.2 version in production and migrate workloads piecemeal, by switching them to /usr/bin/spark35-submit
Then we can decommission 3.1.2 when all workloads have migrated.

I think that we will also need to build an airflow container image with spark 3.5.8, but this should be OK.

Notably, this will exclude the jupyterhub components and the ability to do conda-analytics-clone - I think that jupyter can be moved to its own conda environment - see T321512: Install jupyterhub separately from conda-analytics

Agreed. That work to separate those two things is certainly overdue tech debt.

Doing it this way, will mean that we can leave our existing spark 3.1.2 version in production and migrate workloads piecemeal,

If we can pull this off it would be amazing Ben. IIRC, we had a similar approach for the Spark2 -> Spark3 upgrade so that we didn't need to port everything at once.

@xcollazo what do you think? Can you see any problems with this?

I see no blockers. I do see extra work since it looks like we'd need to branch conda-analytics, refinery, and perhaps other projects so that we have can build them separately against Spark 3.1 and the new spark. I would also like to know what @JAllemandou thinks of this.


I would be happy to help validate things along the way!

BTW Spark 3.5.x will give us runway till Nov 2027:

Spark 3.5.x is the final minor release of the 3.x line and carries an extended LTS designation through November 2027 (security fixes only).

The extension is documented in the official versioning policy and was formally proposed via SPARK-55489 (Holden Karau, February 2026), motivated by there being less than one year between the Spark 4.0 GA release and the original 3.5 EOL date.

Note: the extended LTS applies only to the primary Spark repo — it does not cover Spark Connect (Swift/Rust/Go) or the Spark Kubernetes Operator, which have separate release cycles.

I have built a debian package containing the spark 3.5.8 binary distribution and tested this a little on an-test-client1002.

So far, it's looking OK.

btullis@an-test-client1002:~$ export SPARK_DIST_CLASSPATH=$(hadoop classpath)

btullis@an-test-client1002:~$ export HADOOP_CONF_DIR=/etc/hadoop/conf

btullis@an-test-client1002:~$ spark358-shell --master yarn
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Spark context Web UI available at http://an-test-client1002.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1773777935345_40011).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.8
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_482)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

I'll start working on the puppet changes required to:

  • host this package on apt.wikimedia.org
  • distribute it to the hosts
  • Configure the SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR environment variables automatically.

I'll also start working on the conda-analytics-35 deb file.

Oh, now this is not so good.

btullis@an-test-client1002:~$ spark358-sql --master yarn
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.

Similarly, it fails if I try to launch the spark-thriftserver.

btullis@an-test-client1002:~$ spark358-thriftserver --master yarn
/usr/bin/spark358-thriftserver: line 36: /opt/spark/3.5.8/bin/spark-thriftserver: No such file or directory

So the Spark binary distribution that is hadoop-free is not built with the -Phive and -Phive-thriftserver profiles enabled.
This might be a big deal. I will continue investigating.

After talking with @BTullis yesterday, I confirm that spark 3.5.8 with hadoop 3 work on our cluster.
I think the easiest path to release this version on the cluster is to provide a new conda-environment embedding this Spark version, add a new shuffler for that version, and create new configuration files for it.

After talking with @BTullis yesterday, I confirm that spark 3.5.8 with hadoop 3 work on our cluster.
I think the easiest path to release this version on the cluster is to provide a new conda-environment embedding this Spark version, add a new shuffler for that version, and create new configuration files for it.

In my view, our priority should be to get the new shuffler working. If we have that, users can opt-in via config for their jobs to run on 3.5.8 in Airflow or in notebooks. I could document the steps to achieve both.

However, if we push a new conda-environment with 3.5.8, then we are forcing users, including wmfdata users, into 3.5.8. Thus, are we saying that we are committing to a full cluster Spark upgrade? Am I missing something?

In my view, our priority should be to get the new shuffler working. If we have that, users can opt-in via config for their jobs to run on 3.5.8 in Airflow or in notebooks. I could document the steps to achieve both.

I've got the new shuffler version 3.5.8 packaged and I'll now work on installing this, in the same way as we have for the other opt-in shuffler versions.

btullis@an-test-client1002:~$ apt-cache policy spark-3.5-yarn-shuffle 
spark-3.5-yarn-shuffle:
  Installed: (none)
  Candidate: 3.5.8-4
  Version table:
     3.5.8-4 1001
       1001 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 Packages

The package installs cleanly.

btullis@an-test-client1002:~$ sudo apt install spark-3.5-yarn-shuffle
<snip>
The following NEW packages will be installed:
  spark-3.5-yarn-shuffle
0 upgraded, 1 newly installed, 0 to remove and 9 not upgraded.
Need to get 75.1 MB of archives.
After this operation, 77.4 MB of additional disk space will be used.
Get:1 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 spark-3.5-yarn-shuffle amd64 3.5.8-4 [75.1 MB]
Fetched 75.1 MB in 1s (64.8 MB/s)                  
Selecting previously unselected package spark-3.5-yarn-shuffle.
Preparing to unpack .../spark-3.5-yarn-shuffle_3.5.8-4_amd64.deb ...
Unpacking spark-3.5-yarn-shuffle (3.5.8-4) ...
Setting up spark-3.5-yarn-shuffle (3.5.8-4) ...

However, if we push a new conda-environment with 3.5.8, then we are forcing users, including wmfdata users, into 3.5.8. Thus, are we saying that we are committing to a full cluster Spark upgrade? Am I missing something?

Agreed, we don't want to force everyone into a big-bang upgrade from 3.1.2 to 3.5.8 so we will need to be able to run two spark environments, in parallel.

The only way that I could think of doing that is to duplicate our conda-analytics environment into a conda-analytics-next environment.
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics-next

I know that it would have been cleaner to have one repo that generates both artifacts, but I felt that there would be too much refactoring to make this work, so I cloned the existing repository.

This new package installs everything that conda-analytics does, but it uses a base path of /opt/conda-analytics-next
It provides the binary wrappers spark35-shell and spark35-submit etc so that these do not clash with the other conda-analytics environment.

Then when users would like to start using the new version, they would just need to use spark35-submit instead of spark3-submit.
This is the cleanest way that I could think of to have the two versions of spark installed concurrently.

For wmfdata-python I have a draft patch: https://gitlab.wikimedia.org/repos/data-engineering/wmfdata-python/-/merge_requests/79 and I have specified this commit hash in the conda-environment.yaml file for conda-analytics-next We might need to review the release instructions to add a new target tag, such as next as well as release.

Change #1285331 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new spark shuffler to the test cluster

https://gerrit.wikimedia.org/r/1285331

Change #1285331 merged by Btullis:

[operations/puppet@production] Add a new spark shuffler to the test cluster

https://gerrit.wikimedia.org/r/1285331

Change #1285361 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add conda-analytics-next to the Hadoop test cluster nodes

https://gerrit.wikimedia.org/r/1285361

I have been able to submit a job on the test cluster that uses spark version 3.5.8 and the corresponding version of the shuffler.

btullis@an-test-client1002:/etc/spark3/conf$ spark35-shell --conf spark.shuffle.service.name="spark_shuffle_3_5" --conf spark.shuffle.service.port=7340

Running /opt/conda-analytics-next/bin/spark-shell $@
SPARK_HOME: /opt/conda-analytics-next/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at hadoop-client-api-3.3.4.jar
hadoop-client-runtime-3.3.4.jar, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics-next/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/05/08 15:42:12 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://an-test-client1002.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1778252993089_0041).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.8
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_482)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select count(1) from wmf.webrequest where year = 2026 and month = 5 and day = 5 and hour = 0").show(10, false)
+--------+                                                                      
|count(1)|
+--------+
|11118   |
+--------+

There are still a few things to make sure are correct in /etc/spark3/conf/spark-defaults.conf and /etc/spark3/conf/spark-env.sh before the two versions will be able to work alongside each other correctly, but this is a good start.

Change #1285740 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the spark 3.5.8 shuffler to the prod hadoop cluster

https://gerrit.wikimedia.org/r/1285740

Change #1285740 merged by Btullis:

[operations/puppet@production] Add the spark 3.5.8 shuffler to the prod hadoop cluster

https://gerrit.wikimedia.org/r/1285740

Change #1285361 merged by Btullis:

[operations/puppet@production] Add conda-analytics-next to the Hadoop test cluster nodes

https://gerrit.wikimedia.org/r/1285361

xcollazo renamed this task from Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 to Upgrade Spark to a version with long term Iceberg support.Mon, May 11, 1:16 PM

I'm very excited this is happening! @BTullis and @xcollazo, thank you for working on it! Some input from the analyst perspective:

Other package upgrades
The Spark upgrade unlocks a lot of upgrades to other packages in conda-analytics. Let's definitely apply those upgrades in conda-analytics-next!

Upgrading to Spark 3.5 should allow us to remove the version specs and pins for:

  • Pandas (T370705, T370707)
  • Numpy (T370710)
  • PyArrow (since I believe that its pin is for the current Pandas version)

It's possible that new specs and pins for later versions of these packages will be needed, but the best way to figure that out is to have an analyst test out a pre-release version of the environment.

And I'm happy to be the analyst that tests out the pre-release version!

(We'll also want to update Wmfdata's dependency requirements in the same way.)

Wmfdata
It seems like, to get the right cluster Spark version, Wmfdata only need to have the corresponding PySpark version in the environment. Is that correct? Or is there some other configuration that needs to happen?

For wmfdata-python I have a draft patch: https://gitlab.wikimedia.org/repos/data-engineering/wmfdata-python/-/merge_requests/79 and I have specified this commit hash in the conda-environment.yaml file for conda-analytics-next We might need to review the release instructions to add a new target tag, such as next as well as release.

Hmm, yes. Ideally, we'd be able to use proper version specifiers for this, so conda-analytics could require 2.5.* and conda-analytics-next could require >= 2.6. But to do that we'd need to start using GitLab's package registry feature so the versions are provided as actual versions rather than tags that happen to be version strings. I'd welcome you doing that if you have the bandwidth, but I can also try to find time to do it myself. Otherwise, yes, a new branch (maybe release-conda-analytics-next for clarity) is probably the least bad option.

New environment naming
This may be the second or third time we've used the pattern of deploying an new analytics environment alongside an old one to allow for a gradual switch. What about naming the repo/package "conda-analytics-v2" rather than "conda-analytics-next"?

I think that better sets the expectation that this is a procedure that will happen from time to time and avoids potential confusion when we need to do this again and are faced with the pattern of forking "conda-analytics-next" into ... "conda-analytics-next-next"? 😁

Analyst experience
How will this interact with terminal commands like conda-analytics-clone and with the environment picker in JupyterHub? Will there be options to use conda-analytics-next with each of those?

If there's a need to set one as the default (either explicitly, like in the JupyterHub dropdown, or implicitly through naming), I would suggest conda-analytics-next. If users aren't aware of the difference/don't have a preference on migration timing, we want them to try out the new environment sooner rather than later. Best case, everything works fine and the migration is done. Worst case, they experience breakage, ask around, and get help either updating their code or going back to the old environment. At a minimum, they're aware of the pending issue sooner rather than later. Many people keep using the same environment for a long time so it could easily be late in the migration period before they try creating a new environment for the first time.

Also, it seems like a good idea to go ahead and decide how long people will have to migrate to Spark 3.5/conda-analytics-next. That way we can set this expectation from the start and allow y'all to plan upfront when you can start removing the old versions. 6 months? I think it could be faster if y'all want, but it seems like the intention is not to push for the fastest possible transition.

Change #1287837 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Install conda-analytics-next to the production Hadoop cluster

https://gerrit.wikimedia.org/r/1287837

Change #1287855 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3

https://gerrit.wikimedia.org/r/1287855

Change #1287837 merged by Btullis:

[operations/puppet@production] Install conda-analytics-next to the production Hadoop cluster

https://gerrit.wikimedia.org/r/1287837

Change #1287855 merged by Btullis:

[operations/puppet@production] Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3

https://gerrit.wikimedia.org/r/1287855

Change #1288486 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the conda-analytics-next prefix for spark35

https://gerrit.wikimedia.org/r/1288486

Change #1288486 merged by Btullis:

[operations/puppet@production] Fix the conda-analytics-next prefix for spark35

https://gerrit.wikimedia.org/r/1288486

I have now pushed out conda-analytics-next and the new profile::hadoop::spark35 to production.

We have a set of configuration files tailored for this version in /etc/spark35/conf
The wrapper scripts are now shipped with puppet and look like this:

btullis@stat1008:~$ cat `which spark35-sql`
#!/usr/bin/env bash
# Managed by Puppet
export SPARK_CONF_DIR=/etc/spark35/conf
exec /usr/lib/spark35/bin/spark-sql "$@"

They can be started like this:

btullis@stat1008:~$ spark35-shell
SPARK_HOME: /opt/conda-analytics-next/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at hadoop-client-api-3.3.4.jar
hadoop-client-runtime-3.3.4.jar, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics-next/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/05/18 10:39:51 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
26/05/18 10:40:00 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://stat1008.eqiad.wmnet:4045
Spark context available as 'sc' (master = yarn, app id = application_1778834239566_49017).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.8
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_482)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

And queries of both parquet and iceberg tables seem to work, with the external shuffle service enabled by default.

scala> spark.sql("select * from wmf_traffic.aqs_hourly where hour = '2026-05-17 00:00:00' limit 5").show();
+------------+-----------+-----------+-------------+-------------+--------------------+-------------+-------------------+
|cache_status|http_status|http_method|response_size|     uri_host|            uri_path|request_count|               hour|
+------------+-----------+-----------+-------------+-------------+--------------------+-------------+-------------------+
|   hit-local|        200|        GET|         2175|wikimedia.org|/api/rest_v1/metr...|            1|2026-05-17 00:00:00|
|   int-front|        429|        GET|         2932|wikimedia.org|/api/rest_v1/metr...|            2|2026-05-17 00:00:00|
|        miss|        200|        GET|        10921|wikimedia.org|/api/rest_v1/metr...|            1|2026-05-17 00:00:00|
|   hit-front|        200|        GET|         2309|wikimedia.org|/api/rest_v1/metr...|            4|2026-05-17 00:00:00|
|   hit-front|        200|        GET|         2496|wikimedia.org|/api/rest_v1/metr...|          147|2026-05-17 00:00:00|
+------------+-----------+-----------+-------------+-------------+--------------------+-------------+-------------------+

I'm very excited this is happening! @BTullis and @xcollazo, thank you for working on it! Some input from the analyst perspective:

Other package upgrades
The Spark upgrade unlocks a lot of upgrades to other packages in conda-analytics. Let's definitely apply those upgrades in conda-analytics-next!

Upgrading to Spark 3.5 should allow us to remove the version specs and pins for:

  • Pandas (T370705, T370707)
  • Numpy (T370710)
  • PyArrow (since I believe that its pin is for the current Pandas version)

Thanks, Yes we will be able to come back to these very soon. I'd just like to make sure that we have a good way for users to be able to configure their tasks to use either conda-analytics or conda-analytics-next, before I spend more time on upgrading the other components within the environment.

And I'm happy to be the analyst that tests out the pre-release version!

Great! We will take you up on this offer, thanks.

Wmfdata
It seems like, to get the right cluster Spark version, Wmfdata only need to have the corresponding PySpark version in the environment. Is that correct? Or is there some other configuration that needs to happen?

I believe that you're right. Setting the corresponding pyspark version is the only thing that needs to happen.

For wmfdata-python I have a draft patch: https://gitlab.wikimedia.org/repos/data-engineering/wmfdata-python/-/merge_requests/79 and I have specified this commit hash in the conda-environment.yaml file for conda-analytics-next We might need to review the release instructions to add a new target tag, such as next as well as release.

Hmm, yes. Ideally, we'd be able to use proper version specifiers for this, so conda-analytics could require 2.5.* and conda-analytics-next could require >= 2.6. But to do that we'd need to start using GitLab's package registry feature so the versions are provided as actual versions rather than tags that happen to be version strings. I'd welcome you doing that if you have the bandwidth, but I can also try to find time to do it myself. Otherwise, yes, a new branch (maybe release-conda-analytics-next for clarity) is probably the least bad option.

I don't think that I do have the bandwidth for this task, I'm sorry to say. I found the documented release process for wmfdata a little confusing, so I think that you might be more comfortable making these changes than I would be.

New environment naming
This may be the second or third time we've used the pattern of deploying an new analytics environment alongside an old one to allow for a gradual switch. What about naming the repo/package "conda-analytics-v2" rather than "conda-analytics-next"?
I think that better sets the expectation that this is a procedure that will happen from time to time and avoids potential confusion when we need to do this again and are faced with the pattern of forking "conda-analytics-next" into ... "conda-analytics-next-next"? 😁

I would think that it's more likely that we will retain just the two environments and promote changes from the next to the production environment.

Analyst experience
How will this interact with terminal commands like conda-analytics-clone and with the environment picker in JupyterHub? Will there be options to use conda-analytics-next with each of those?

Yes, we have the commands:

  • conda-analytics-next-activate
  • conda-analytics-next-clone
  • conda-analytics-next-deactivate
  • conda-analytics-next-list

They should work just like their non-next variants and will share the ~/.conda/ directory.
The environments that are created should be compatible with Jupyter, although at the moment we haven't thought about how to get JupyterHub to spawn a new conda-analytics-nextenvironment itself.
This is currently hard-coded to use /opt/conda-analytics.

If there's a need to set one as the default (either explicitly, like in the JupyterHub dropdown, or implicitly through naming), I would suggest conda-analytics-next. If users aren't aware of the difference/don't have a preference on migration timing, we want them to try out the new environment sooner rather than later. Best case, everything works fine and the migration is done. Worst case, they experience breakage, ask around, and get help either updating their code or going back to the old environment. At a minimum, they're aware of the pending issue sooner rather than later. Many people keep using the same environment for a long time so it could easily be late in the migration period before they try creating a new environment for the first time.

I will give this some thought.

Also, it seems like a good idea to go ahead and decide how long people will have to migrate to Spark 3.5/conda-analytics-next. That way we can set this expectation from the start and allow y'all to plan upfront when you can start removing the old versions. 6 months? I think it could be faster if y'all want, but it seems like the intention is not to push for the fastest possible transition.

Agreed, but right now we are working on how we can give people a simple means of selecting which Spark version they would like by changing a DAG parameter.
I wouldn't like to talk about timelines until we have verified that users have an easy migration path.

Change #1093393 abandoned by Btullis:

[analytics/refinery/source@master] Update Spark to version 3.5.3

Reason:

Leaving this to Data Engineering to upgrade

https://gerrit.wikimedia.org/r/1093393

@BTullis thank you for all the responses! Sounds like you're basically ahead of me on all the fronts, which is not surprising 😁

And I'm happy to be the analyst that tests out the pre-release version!

Great! We will take you up on this offer, thanks.

Sounds good! FYI, I'm out the rest of this week and then again June 3-6.

Hmm, yes. Ideally, we'd be able to use proper version specifiers for this, so conda-analytics could require 2.5.* and conda-analytics-next could require >= 2.6. But to do that we'd need to start using GitLab's package registry feature so the versions are provided as actual versions rather than tags that happen to be version strings. I'd welcome you doing that if you have the bandwidth, but I can also try to find time to do it myself. Otherwise, yes, a new branch (maybe release-conda-analytics-next for clarity) is probably the least bad option.

I don't think that I do have the bandwidth for this task, I'm sorry to say. I found the documented release process for wmfdata a little confusing, so I think that you might be more comfortable making these changes than I would be.

Totally fair! I looked into what it would take to start using the GitLab package registry and it doesn't look that hard to set up, but I think I still shouldn't commit to it right now. So I think the two branch idea is good (and I think release-next is actually a fine name since Wmfdata is so closely tied to conda-analytics that there's no need for disambiguation).

But I'm happy to take care of that branching and releasing. I'm thinking I'll wait until conda-analytics-next is more finalized (particularly knowing the package specs so I can update Wmfdata's dependencies accordingly), but let me know if it would be helpful to have something done sooner.

New environment naming
This may be the second or third time we've used the pattern of deploying an new analytics environment alongside an old one to allow for a gradual switch. What about naming the repo/package "conda-analytics-v2" rather than "conda-analytics-next"?
I think that better sets the expectation that this is a procedure that will happen from time to time and avoids potential confusion when we need to do this again and are faced with the pattern of forking "conda-analytics-next" into ... "conda-analytics-next-next"? 😁

I would think that it's more likely that we will retain just the two environments and promote changes from the next to the production environment.

Ah, yes, that makes a lot of sense! We eventually promote conda-analytics-next to conda-analytics. conda-analytics-next stays around but behaves identically to conda-analytics until we need it again, like superset-next.wikimedia.org