Bump Spark to 3.3.x or 3.4.x line.
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	xcollazo
	Aug 15 2023, 3:42 PM

Description

While iterating on an Apache Iceberg MERGE INTO on T340861, we hit T342587, in which the MERGE job generates ~55000 small files

The old trick of adding a COALESCE hint did not fix the small files generated by the MERGE INTO. This is so because MERGE generates a custom query plan specific for Iceberg. The COALESCE is added, but not at the right node. See T340861#9093603 for query plan details.

We can go around this with Iceberg's rewrite_data_files(), but it is annoying and basically makes us write the data twice (one with ~55K files, and another time compacting it to 2 or 3 files!). Turns out this is a known issue on Iceberg's MERGE INTO that has been solved, but on Spark 3.2+. See https://github.com/apache/iceberg/pull/6828. TLDR: This is fixed on Iceberg support for Spark 3.3, and backported to Iceberg support for Spark 3.2, but not Spark 3.1. Note that we do not need to bump Iceberg to pickup this fixes, just Spark.

Although this is not a blocker, I believe it is a compelling reason to upgrade Spark.

We can pickup Spark 3.3.2, which is stable. Or we could go for Spark 3.4.1, which gives us the longest runway, but is presumably not as stable as the 3.3 branch as it has been released in the last couple months.

Bumping Spark to 3.3+ will also give us access to adaptive query execution improvements (think skewed join auto fix) that we have discussed before as nice to haves.

Related Objects
Search...

Status	Assigned	Task
Open	VirginiaPoundstone	T345988 [Epic] XML MediaWiki data dumps for right to fork
Resolved	Milimetric	T330296 Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
Duplicate	None	T344266 Bump Spark to 3.3.x or 3.4.x line.