We need to make sure that the Iceberg catalog allows querying non iceberg datasets to provide access to both Iceberg and non-iceberg tables in the same catalog (otherwise it'll be more confusing for users).
Description
Revisions and Commits
| rETHR extension-3D | |||
| rETHRb7b1df506d0a Create change | |||
| rETHR1f9501729b56 Localisation updates from https://translatewiki.net. | |||
| rECHU extension-CheckUser | |||
| rECHUc654b0838682 Update patch set 1 | |||
| rECHU27240ef28f08 Update patch set 1 | |||
| rECHU5e9e8bcd3d81 Update patch set 1 | |||
| rECHU49272de2de70 Update patch set 1 | |||
| rECHU2816f03fe130 Update patch set 1 | |||
| rECHU32e486b45ed6 Update patch set 1 | |||
| rEAMS extension-AddMessages | |||
| rEAMSec33c086ca7d Localisation updates from https://translatewiki.net. | |||
| rEAMS98b0842eede5 Create change | |||
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T333013 [Iceberg Migration] Apache Iceberg Migration | |||
| Resolved | xcollazo | T335314 Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data |
Event Timeline
T335721 took care of this from Spark side. We can now run a spark session that includes both iceberg and non-iceberg tables declared on the Hive Metastore, without the need to add any extra configuration:
xcollazo@stat1007:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 ... spark-sql (default)> select * from xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10; country lang browser_family os_family search_engine num_referrals day ... Time taken: 16.535 seconds, Fetched 10 row(s) spark-sql (default)> select * from wmf.referrer_daily limit 10; country lang browser_family os_family search_engine num_referrals year month day ... Time taken: 18.836 seconds, Fetched 10 row(s)
Details on how this works at T335721.
For Presto, the story is different.
In T311525, we took care of supporting Iceberg by making it available as its own Presto Catalog:
presto> show catalogs;
Catalog
-------------------
analytics_hive
analytics_iceberg
system
(3 rows)
Query 20230515_203952_00138_6vuze, FINISHED, 15 nodes
Splits: 257 total, 257 done (100.00%)
272ms [0 rows, 0B] [0 rows/s, 0B/s]Unfortunately, Hive tables work on analytics_hive catalog but Iceberg tables fail, and viceversa:
presto:analytics_iceberg> select * from analytics_iceberg.wmf.referrer_daily limit 10;
Query 20230516_020606_00017_6vuze failed: Not an Iceberg table: wmf.referrer_daily
presto:analytics_iceberg> select * from analytics_hive.wmf.referrer_daily limit 10;
country | lang | browser_family | os_family | search_engine | num_referrals | year | mo
----------------------+-------+-------------------+-----------+---------------+---------------+------+---
...
(10 rows)
Query 20230516_020622_00018_6vuze, FINISHED, 15 nodes
Splits: 542 total, 537 done (99.08%)
0:05 [7.8K rows, 11.7MB] [1.59K rows/s, 2.39MB/s]
presto:analytics_iceberg> select * from analytics_hive.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10;
Query 20230516_015231_00016_6vuze failed: Unable to create input format org.apache.hadoop.mapred.FileInputFormat
presto:analytics_iceberg> select * from analytics_iceberg.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10;
country | lang | browser_family | os_family | search_engine | num_referrals | day
----------------------+------+----------------+-----------+---------------+---------------+------------
...
(10 rows)
Query 20230516_015215_00015_6vuze, FINISHED, 4 nodes
Splits: 21 total, 18 done (85.71%)
0:01 [15 rows, 31.4KB] [19 rows/s, 40.4KB/s]AFAICT, this is inherent behavior of the implementation of Iceberg on Presto. Even on the latest Presto release, there doesn't seem to be an equivalent 'hybrid'/wrapper catalog that can handle both Hive and Iceberg tables in one (In contrast to Spark's org.apache.iceberg.spark.SparkSessionCatalog that does support both at the same time).
For completeness, I investigated whether Trino had any better support for this, and the answer seems to be no.
So it looks like we will either have to:
A) live with two separate catalogs in Presto until we migrate all tables to Iceberg, or
B) we will have to come up with our own Presto catalog implementation to be able to support both table types in one. Perhaps the Presto/Iceberg community would be interested in such a contribution.
CC @JAllemandou.
In our Iceberg Working Session we decided not to pursue (A) nor (B) above. Instead, we will keep Iceberg tables in separate Hive databases to not confuse users when they try and query them with Presto. This concludes the needed work for this ticket.
(We are also discussing a new functional decomposition of the 56 tables currently under wmf, but this is being done at T337562.)