We need to make sure that the Iceberg catalog allows querying non iceberg datasets to provide access to both Iceberg and non-iceberg tables in the same catalog (otherwise it'll be more confusing for users).
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T333013 [Iceberg Migration] Apache Iceberg Migration | |||
Resolved | xcollazo | T335314 Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data |
Event Timeline
T335721 took care of this from Spark side. We can now run a spark session that includes both iceberg and non-iceberg tables declared on the Hive Metastore, without the need to add any extra configuration:
xcollazo@stat1007:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 ... spark-sql (default)> select * from xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10; country lang browser_family os_family search_engine num_referrals day ... Time taken: 16.535 seconds, Fetched 10 row(s) spark-sql (default)> select * from wmf.referrer_daily limit 10; country lang browser_family os_family search_engine num_referrals year month day ... Time taken: 18.836 seconds, Fetched 10 row(s)
Details on how this works at T335721.
For Presto, the story is different.
In T311525, we took care of supporting Iceberg by making it available as its own Presto Catalog:
presto> show catalogs; Catalog ------------------- analytics_hive analytics_iceberg system (3 rows) Query 20230515_203952_00138_6vuze, FINISHED, 15 nodes Splits: 257 total, 257 done (100.00%) 272ms [0 rows, 0B] [0 rows/s, 0B/s]
Unfortunately, Hive tables work on analytics_hive catalog but Iceberg tables fail, and viceversa:
presto:analytics_iceberg> select * from analytics_iceberg.wmf.referrer_daily limit 10; Query 20230516_020606_00017_6vuze failed: Not an Iceberg table: wmf.referrer_daily presto:analytics_iceberg> select * from analytics_hive.wmf.referrer_daily limit 10; country | lang | browser_family | os_family | search_engine | num_referrals | year | mo ----------------------+-------+-------------------+-----------+---------------+---------------+------+--- ... (10 rows) Query 20230516_020622_00018_6vuze, FINISHED, 15 nodes Splits: 542 total, 537 done (99.08%) 0:05 [7.8K rows, 11.7MB] [1.59K rows/s, 2.39MB/s] presto:analytics_iceberg> select * from analytics_hive.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10; Query 20230516_015231_00016_6vuze failed: Unable to create input format org.apache.hadoop.mapred.FileInputFormat presto:analytics_iceberg> select * from analytics_iceberg.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10; country | lang | browser_family | os_family | search_engine | num_referrals | day ----------------------+------+----------------+-----------+---------------+---------------+------------ ... (10 rows) Query 20230516_015215_00015_6vuze, FINISHED, 4 nodes Splits: 21 total, 18 done (85.71%) 0:01 [15 rows, 31.4KB] [19 rows/s, 40.4KB/s]
AFAICT, this is inherent behavior of the implementation of Iceberg on Presto. Even on the latest Presto release, there doesn't seem to be an equivalent 'hybrid'/wrapper catalog that can handle both Hive and Iceberg tables in one (In contrast to Spark's org.apache.iceberg.spark.SparkSessionCatalog that does support both at the same time).
For completeness, I investigated whether Trino had any better support for this, and the answer seems to be no.
So it looks like we will either have to:
A) live with two separate catalogs in Presto until we migrate all tables to Iceberg, or
B) we will have to come up with our own Presto catalog implementation to be able to support both table types in one. Perhaps the Presto/Iceberg community would be interested in such a contribution.
CC @JAllemandou.
In our Iceberg Working Session we decided not to pursue (A) nor (B) above. Instead, we will keep Iceberg tables in separate Hive databases to not confuse users when they try and query them with Presto. This concludes the needed work for this ticket.
(We are also discussing a new functional decomposition of the 56 tables currently under wmf, but this is being done at T337562.)