Page MenuHomePhabricator

Iceberg 1.6.1 bug makes SELECTs fail due to vectorized read path being the default
Open, Needs TriagePublic

Description

Summary

MWHistoryDeltaWriter (Spark 3.5 job) crashes on executors with java.lang.NoClassDefFoundError deep inside the Iceberg Arrow vectorized Parquet reader:

java.lang.NoClassDefFoundError
  at org.apache.iceberg.shaded.io.netty.util.internal.shaded.org.jctools.queues.MessagePassingQueueUtil
  ...
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read
  at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read

Presumed root cause

The cluster now runs two Spark/Iceberg stacks side by side:

  • Spark 3.1.2 + Iceberg 1.2.1 (existing)
  • Spark 3.5.8 + Iceberg 1.6.1 (new)

Both Iceberg runtime JARs end up on the executor classpath when submitting a Spark 3.5 job. Iceberg 1.2.1 and 1.6.1 ship different shaded versions of Arrow and Netty internally. When the JVM loads Arrow/Netty classes from the 1.2.1 JAR first, the 1.6.1 vectorized reader cannot find its own shaded JCTools classes and crashes.

refinery-job-35/pom.xml correctly declares iceberg-spark-runtime-3.5_2.12:1.6.1 as provided (not shaded in), so the fix is on the cluster configuration side, not in the job.

Current workaround

--conf spark.sql.iceberg.vectorization.enabled=false disables the Arrow vectorized reader entirely, avoiding the conflict. Job runs successfully.

Investigation / fix

Ensure the Iceberg 1.2.1 JAR is not present on the executor classpath when submitting Spark 3.5 jobs — either by isolating the two Spark stacks' classpaths or by adding spark.executor.userClassPathFirst=true so the 1.6.1 JAR takes precedence. Confirm by re-running without vectorization.enabled=false.

Event Timeline

Simple repro with spark35-sql:

spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY

select count(1) as count, source from xcollazo.mediawiki_history_incremental_v1 group by source;

--conf spark.sql.iceberg.vectorization.enabled=false fixes it.

This could very well be a legit Iceberg bug as well.

Both Iceberg runtime JARs end up on the executor classpath when submitting a Spark 3.5 job.

Could you expand a little on this, please? Maybe it's a configuration that we can override.

Since the two jars are in /opt/conda-analytics and /opt/conda-analytics-next they are intended to be totally isolated.

btullis@an-worker1200:~$ find /opt/conda-analytics* -name iceberg*
/opt/conda-analytics/lib/python3.10/site-packages/pyspark/jars/iceberg-spark-runtime-3.1_2.12-1.2.1.jar
/opt/conda-analytics-next/lib/python3.10/site-packages/pyspark/jars/iceberg-spark-runtime-3.5_2.12-1.6.1.jar

I didn't think that the spark35-submit command made reference to /opt/conda-analytics at all.

I think we have hit : https://github.com/apache/iceberg/issues/521
Our schema has nested fields (arrays). It'd be interesting to check without those fields to validate.

It'd be interesting to check without those fields to validate.

I can give it a go and come back

Tested and got the same result.

repro:

CREATE TABLE your_db.mediawiki_history_incremental_v1 (
  source STRING COMMENT 'Row provenance: events (daily delta) or snapshot (monthly merge).',
  wiki_id STRING COMMENT 'Wiki identifier, e.g. enwiki, dewiki, eswiktionary.',
  event_entity STRING COMMENT 'Entity type: revision, user, or page.',
  event_type STRING COMMENT 'Event sub-type: create, edit, move, delete, etc.',
  event_timestamp TIMESTAMP COMMENT 'When this event occurred.',
  event_user_id BIGINT COMMENT 'Local MediaWiki user ID of the actor; NULL for anonymous users.',
  event_user_central_id BIGINT COMMENT 'Global CentralAuth user ID of the actor; NULL for anonymous users.',
  event_user_text_historical STRING COMMENT 'Username or IP at the time of the event.',
  event_user_is_bot_by_historical STRING COMMENT 'Bot classification methods at event time: name and/or group.',
  event_user_is_created_by_self BOOLEAN COMMENT 'True if the user registered their own account.',
  event_user_is_anonymous BOOLEAN COMMENT 'True if the actor had no local user account at event time.',
  event_user_is_temporary BOOLEAN COMMENT 'True if the actor was a temporary (auto-created) account.',
  event_user_is_permanent BOOLEAN COMMENT 'True if the actor was a permanent registered account.',
  event_user_registration_timestamp TIMESTAMP COMMENT 'When the actor account was registered; NULL for anonymous users.',
  event_user_revision_count BIGINT COMMENT 'Edit count of the actor at event time.',
  page_id BIGINT COMMENT 'Page ID at event time.',
  page_title_historical STRING COMMENT 'Page title (without namespace prefix) at event time.',
  page_namespace_historical INT COMMENT 'Page namespace ID at event time.',
  page_namespace_is_content_historical BOOLEAN COMMENT 'True if page_namespace_historical is a content namespace.',
  revision_id BIGINT COMMENT 'Revision ID; NULL for non-revision events.',
  revision_parent_id BIGINT COMMENT 'Parent revision ID; NULL for page-creation revisions.',
  revision_minor_edit BOOLEAN COMMENT 'True if the editor flagged this as a minor edit.',
  revision_text_bytes BIGINT COMMENT 'Uncompressed byte size of the revision text.',
  revision_text_bytes_diff BIGINT COMMENT 'Byte delta vs. parent revision; NULL for page-creation revisions.',
  revision_text_sha1 STRING COMMENT 'SHA-1 of the concatenated slot content (all-slots hash).',
  revision_is_identity_reverted BOOLEAN COMMENT 'True if a later revision restored the page to the state before this one. For source=events rows, bounded to 90 days; may be patched when a late revert arrives.',
  revision_first_identity_reverting_revision_id BIGINT COMMENT 'ID of the first revision that identity-reverted this one; NULL if not reverted.',
  revision_seconds_to_identity_revert BIGINT COMMENT 'Seconds between this revision and the first identity revert; NULL if not reverted.',
  revision_is_identity_revert BOOLEAN COMMENT 'True if this revision itself is an identity revert (restores a prior page state).',
  revision_is_identity_reverted_within_90_days BOOLEAN COMMENT 'Bounded tier: true when revision_is_identity_reverted is true and the revert arrived within 90 days. Always true or false, never NULL.',
  revision_first_identity_reverting_revision_id_within_90_days BIGINT COMMENT 'ID of the first identity-reverting revision if the revert was within 90 days; NULL otherwise.',
  revision_seconds_to_identity_revert_within_90_days BIGINT COMMENT 'Seconds to first identity revert if within 90 days; NULL otherwise.',
  revision_is_identity_revert_within_90_days BOOLEAN COMMENT 'Bounded tier: true when this revision is an identity revert and the reverted revision was reverted within 90 days. Always true or false, never NULL.',
  revision_tags STRING COMMENT 'Change tags applied to this revision.')
USING iceberg
PARTITIONED BY (days(event_timestamp))
LOCATION 'xxx/your_db.db/mediawiki_history_incremental_v1'
TBLPROPERTIES (
  'current-snapshot-id' = '7655307092681748022',
  'format' = 'iceberg/parquet',
  'format-version' = '2',
  'write.parquet.compression-codec' = 'zstd',
  'write.target-file-size-bytes' = '134217728')

in my case your_db=apizzata, note the absence of nested fields.

spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY \
  --conf spark.sql.iceberg.vectorization.enabled=false

insert into apizzata.mediawiki_history_incremental_v1
select
source,wiki_id,event_entity,event_type,event_timestamp,event_user_id,event_user_central_id,event_user_text_historical,event_user_is_bot_by_historical[0],event_user_is_created_by_self,event_user_is_anonymous,event_user_is_temporary,event_user_is_permanent,event_user_registration_timestamp,event_user_revision_count,page_id,page_title_historical,page_namespace_historical,page_namespace_is_content_historical,revision_id,revision_parent_id,revision_minor_edit,revision_text_bytes,revision_text_bytes_diff,revision_text_sha1,revision_is_identity_reverted,revision_first_identity_reverting_revision_id,revision_seconds_to_identity_revert,revision_is_identity_revert,revision_is_identity_reverted_within_90_days,revision_first_identity_reverting_revision_id_within_90_days,revision_seconds_to_identity_revert_within_90_days,revision_is_identity_revert_within_90_days,revision_tags[0]
from xcollazo.mediawiki_history_incremental_v1

Response code
Time taken: 577.826 seconds

select count(1), source from apizzata.mediawiki_history_incremental_v1 group by source;
count(1)	source
33406783	events
8347238933	snapshot
Time taken: 57.582 seconds, Fetched 2 row(s)


 select count(1), source from xcollazo.mediawiki_history_incremental_v1 group by source;
count(1)	source
33406783	events
8347238933	snapshot
Time taken: 83.16 seconds, Fetched 2 row(s)


exit;

As expected it works with --conf spark.sql.iceberg.vectorization.enabled=false.

spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY 

select count(1), source from apizzata.mediawiki_history_incremental_v1 group by source;

26/05/21 14:44:17 ERROR TaskSetManager: Task 13 in stage 0.0 failed 4 times; aborting job
Job aborted due to stage failure: Task 13 in stage 0.0 failed 4 times,
most recent failure: Lost task 13.3 in stage 0.0 (TID 63)
(an-worker1195.eqiad.wmnet executor 4):
java.lang.NoClassDefFoundError:
    at .drain(MessagePassingQueueUtil.java:39)
    at org.apache.iceberg.shaded.io.netty.util.internal.shaded.org.jctools.queues.BaseMpscLinkedArrayQueue.drain(BaseMpscLinkedArrayQueue.java:612)
    at org.apache.iceberg.shaded.io.netty.util.internal.shaded.org.jctools.queues.MpscChunkedArrayQueue.drain(MpscChunkedArrayQueue.java:43)
    at org.apache.iceberg.shaded.io.netty.util.Recycler$LocalPool.claim(Recycler.java:326)
    at org.apache.iceberg.shaded.io.netty.util.Recycler.get(Recycler.java:181)
    at org.apache.iceberg.shaded.io.netty.util.internal.ObjectPool$RecyclerObjectPool.get(ObjectPool.java:86)
    at org.apache.iceberg.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
    at org.apache.iceberg.shaded.io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:720)
    at org.apache.iceberg.shaded.io.netty.buffer.PoolArena.allocate(PoolArena.java:125)
    at org.apache.iceberg.shaded.io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.newDirectBufferL(PooledByteBufAllocatorL.java:178)
    at org.apache.iceberg.shaded.io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.directBuffer(PooledByteBufAllocatorL.java:211)
    at org.apache.iceberg.shaded.io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:58)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:77)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:84)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.NettyAllocationManager$1.create(NettyAllocationManager.java:34)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:355)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:350)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.BaseAllocator.bufferWithoutReservation(BaseAllocator.java:338)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:316)
    at org.apache.iceberg.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:280)
    at org.apache.iceberg.shaded.org.apache.arrow.vector.BaseValueVector.allocFixedDataAndValidityBufs(BaseValueVector.java:224)
    at org.apache.iceberg.shaded.org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:442)
    at org.apache.iceberg.shaded.org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:758)
    at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:80)
    at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:150)
    at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123)
    at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98)
    at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72)
    at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44)
    at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147)
    at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:138)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:146)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:184)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:71)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
    ...

Ideally we would try with latest Iceberg 1.11.0 to see if we can repro there, but 1.6.1 is the last version with Java 8 support.

Thus longer term we should look into running Java 11 on top of Yarn.

I have tested with the following spark session:

spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \   
  --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
  --conf spark.sql.catalog.iceberg_check=org.apache.iceberg.spark.SparkCatalog \ 
  --conf spark.sql.catalog.iceberg_check.type=hadoop \
  --conf spark.sql.catalog.iceberg_check.warehouse=hdfs:///tmp/iceberg-version-check \ <--- amenities to check the iceberg version
  --conf spark.driver.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executor.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executorEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.archive=hdfs:///user/a-pizzata/artifacts/spark-3.5.8-jars.zip \
  --conf spark.jars=hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.6.1.jar  <--- providing my own iceberg runtime

SET spark.jars;
key	value
spark.jars	hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.6.1.jar
Time taken: 0.085 seconds, Fetched 1 row(s)

SET spark.driver.userClassPathFirst;
key	value
spark.driver.userClassPathFirst	true
Time taken: 0.064 seconds, Fetched 1 row(s)

SET spark.executor.userClassPathFirst;
key	value
spark.executor.userClassPathFirst	true
Time taken: 0.049 seconds, Fetched 1 row(s)

SELECT iceberg_check.system.iceberg_version();
26/05/22 12:46:18 INFO CatalogUtil: Loading custom FileIO implementation: org.apache.iceberg.hadoop.HadoopFileIO
staticinvoke(class org.apache.iceberg.spark.functions.IcebergVersionFunction$IcebergVersionFunctionImpl, StringType, invoke, false, false, true)
1.6.1
Time taken: 0.321 seconds, Fetched 1 row(s)

select count(1) as count, source from xcollazo.mediawiki_history_incremental_v1 group by source;

AllocateBytes(BaseVariableWidthVector.java:462)
	at org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector.allocateNew(BaseVariableWidthVector.java:420)
	at org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector.allocateNewSafe(BaseVariableWidthVector.java:394)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:271)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:217)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:142)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:138)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:146)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:184)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD

This brings the error that we already know.

Ideally we would try with latest Iceberg 1.11.0 to see if we can repro there,

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  
export PATH="$JAVA_HOME/bin:$PATH"
java -version   
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
openjdk version "11.0.31" 2026-04-21
OpenJDK Runtime Environment (build 11.0.31+11-post-1-deb11u1-Debian)
OpenJDK 64-Bit Server VM (build 11.0.31+11-post-1-deb11u1-Debian, mixed mode, sharing)
spark35-sql \
  --master yarn \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=8g \
  --conf spark.sql.shuffle.partitions=768 \
  --conf spark.dynamicAllocation.maxExecutors=128 \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.sql.legacy.timeParserPolicy=LEGACY \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
  --conf spark.sql.catalog.iceberg_check=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.iceberg_check.type=hadoop \
  --conf spark.sql.catalog.iceberg_check.warehouse=hdfs:///tmp/iceberg-version-check \ <--- amenities to check the iceberg version
  --conf spark.driver.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executor.userClassPathFirst=true \ <--- amenities to check the iceberg version
  --conf spark.executorEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME \
  --conf spark.eventLog.enabled=false \
  --conf spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf \
  --conf spark.yarn.archive=hdfs:///user/a-pizzata/artifacts/spark-3.5.8-jars.zip \
  --conf spark.yarn.appMasterEnv.JAVA_HOME=$JAVA_HOME \ <--- force Java 11 for iceberg 1.10.1
  --conf spark.executorEnv.JAVA_HOME=$JAVA_HOME \  <--- force Java 11 for iceberg 1.10.1
  --conf spark.jars=hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.10.1.jar \ <--- providing my own iceberg runtime
  --conf spark.sql.iceberg.vectorization.enabled=true  <--- forcing it for logging purposes

SET spark.jars;
key	value
spark.jars	hdfs:///user/a-pizzata/artifacts/iceberg-spark-runtime-3.5_2.12-1.10.1.jar
Time taken: 1.183 seconds, Fetched 1 row(s)

SET spark.driver.userClassPathFirst;
key	value
spark.driver.userClassPathFirst	true
Time taken: 0.052 seconds, Fetched 1 row(s)

SET spark.executor.userClassPathFirst;
key	value
spark.executor.userClassPathFirst	true
Time taken: 0.048 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT iceberg_check.system.iceberg_version();
26/05/22 13:05:28 INFO CatalogUtil: Loading custom FileIO implementation: org.apache.iceberg.hadoop.HadoopFileIO
staticinvoke(class org.apache.iceberg.spark.functions.IcebergVersionFunction$IcebergVersionFunctionImpl, StringType, invoke, false, false, true)
1.10.1
Time taken: 7.159 seconds, Fetched 1 row(s)

SET spark.sql.iceberg.vectorization.enabled;
key	value
spark.sql.iceberg.vectorization.enabled	true
Time taken: 0.043 seconds, Fetched 1 row(s)


select count(1) as count, source from xcollazo.mediawiki_history_incremental_v1 group by source;
26/05/22 13:06:11 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: hdfs://analytics-hadoop/user/hive/warehouse/xcollazo.db/mediawiki_history_incremental_v1/metadata/00058-2269b2cb-94a0-4129-8ff5-8a1ff105f5eb.metadata.json
26/05/22 13:06:11 INFO BaseMetastoreCatalog: Table loaded by catalog: spark_catalog.xcollazo.mediawiki_history_incremental_v1
26/05/22 13:06:11 INFO SparkScanBuilder: Skipping aggregate pushdown: group by aggregation push down is not supported
26/05/22 13:06:11 INFO SnapshotScan: Scanning table spark_catalog.xcollazo.mediawiki_history_incremental_v1 snapshot 7655307092681748022 created at 2026-05-20T03:36:15.149+00:00 with filter true
26/05/22 13:06:11 INFO BaseDistributedDataScan: Planning file tasks locally for table spark_catalog.xcollazo.mediawiki_history_incremental_v1
26/05/22 13:06:12 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 4620 partition(s) for table spark_catalog.xcollazo.mediawiki_history_incremental_v1
count	source
33406783	events
8347238933	snapshot
Time taken: 80.018 seconds, Fetched 2 row(s)


elect count(1) as count, source from apizzata.mediawiki_history_incremental_v1 group by source;
26/05/22 13:07:41 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: hdfs://analytics-hadoop/user/hive/warehouse/apizzata.db/mediawiki_history_incremental_v1/metadata/00001-8771eac9-2a2d-4462-8970-7e660c55a19f.metadata.json
26/05/22 13:07:41 INFO BaseMetastoreCatalog: Table loaded by catalog: spark_catalog.apizzata.mediawiki_history_incremental_v1
26/05/22 13:07:41 INFO SparkScanBuilder: Skipping aggregate pushdown: group by aggregation push down is not supported
26/05/22 13:07:41 INFO SnapshotScan: Scanning table spark_catalog.apizzata.mediawiki_history_incremental_v1 snapshot 8076055057305837206 created at 2026-05-21T14:18:20.294+00:00 with filter true
26/05/22 13:07:41 INFO BaseDistributedDataScan: Planning file tasks locally for table spark_catalog.apizzata.mediawiki_history_incremental_v1
26/05/22 13:07:41 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 4242 partition(s) for table spark_catalog.apizzata.mediawiki_history_incremental_v1
count	source
33406783	events
8347238933	snapshot
Time taken: 52.466 seconds, Fetched 2 row(s)

With this config both the table with nested fields (the one in xcollazo db) and the one without nested fields (the one in apizzata db) produce results without errors.

Just for kicks and giggles I tested the same iceberg-1.10.1 with --conf spark.sql.iceberg.vectorization.enabled=false to see the runtime:

select count(1) as count, source from xcollazo.mediawiki_history_incremental_v1 group by source;
26/05/22 13:20:23 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: hdfs://analytics-hadoop/user/hive/warehouse/xcollazo.db/mediawiki_history_incremental_v1/metadata/00058-2269b2cb-94a0-4129-8ff5-8a1ff105f5eb.metadata.json
26/05/22 13:20:24 INFO BaseMetastoreCatalog: Table loaded by catalog: spark_catalog.xcollazo.mediawiki_history_incremental_v1
26/05/22 13:20:24 INFO SparkScanBuilder: Skipping aggregate pushdown: group by aggregation push down is not supported
26/05/22 13:20:25 INFO SnapshotScan: Scanning table spark_catalog.xcollazo.mediawiki_history_incremental_v1 snapshot 7655307092681748022 created at 2026-05-20T03:36:15.149+00:00 with filter true
26/05/22 13:20:25 INFO BaseDistributedDataScan: Planning file tasks locally for table spark_catalog.xcollazo.mediawiki_history_incremental_v1
26/05/22 13:20:26 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 4618 partition(s) for table spark_catalog.xcollazo.mediawiki_history_incremental_v1
count	source
33406783	events
8347238933	snapshot
Time taken: 104.784 seconds, Fetched 2 row(s)

select count(1) as count, source from apizzata.mediawiki_history_incremental_v1 group by source;
26/05/22 13:22:44 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: hdfs://analytics-hadoop/user/hive/warehouse/apizzata.db/mediawiki_history_incremental_v1/metadata/00001-8771eac9-2a2d-4462-8970-7e660c55a19f.metadata.json
26/05/22 13:22:44 INFO BaseMetastoreCatalog: Table loaded by catalog: spark_catalog.apizzata.mediawiki_history_incremental_v1
26/05/22 13:22:44 INFO SparkScanBuilder: Skipping aggregate pushdown: group by aggregation push down is not supported
26/05/22 13:22:44 INFO SnapshotScan: Scanning table spark_catalog.apizzata.mediawiki_history_incremental_v1 snapshot 8076055057305837206 created at 2026-05-21T14:18:20.294+00:00 with filter true
26/05/22 13:22:44 INFO BaseDistributedDataScan: Planning file tasks locally for table spark_catalog.apizzata.mediawiki_history_incremental_v1
26/05/22 13:22:44 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 4242 partition(s) for table spark_catalog.apizzata.mediawiki_history_incremental_v1
count	source
33406783	events
8347238933	snapshot
Time taken: 66.715 seconds, Fetched 2 row(s)

Could we update the title and description of this ticket, now that we know a bit more about it, please?

Is it now 100% confirmed that it is a bug with Iceberg version 1.6.1 and nested fields?

xcollazo renamed this task from Iceberg 1.2.1 JAR seems to clash with 1.6.1 on Spark 3.5 executor classpath to Iceberg 1.6.1 bug makes SELECTs fail due to vectorized read path being the default.Tue, Jun 2, 6:10 PM