Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	JAllemandou
	Apr 24 2023, 8:19 PM

Description

We need to make sure that the Iceberg catalog allows querying non iceberg datasets to provide access to both Iceberg and non-iceberg tables in the same catalog (otherwise it'll be more confusing for users).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T333013 [Iceberg Migration] Apache Iceberg Migration
		Resolved		xcollazo	T335314 Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data

Event Timeline

JAllemandou created this task.Apr 24 2023, 8:19 PM

JAllemandou mentioned this in T335305: Migrate referrer_daily to Iceberg.

• lbowmaker removed a parent task: T334601: [needs updating] Migration Dataset: XXX to Apache Iceberg.Apr 24 2023, 10:41 PM

• lbowmaker moved this task from Backlog to To be discussed /To be estimated on the Data Pipelines board.May 3 2023, 1:05 PM

JArguello-WMF assigned this task to xcollazo.May 15 2023, 4:44 PM

JArguello-WMF edited projects, added Data Pipelines (Sprint 14); removed Data Pipelines.

JArguello-WMF set the point value for this task to 1.

T335721 took care of this from Spark side. We can now run a spark session that includes both iceberg and non-iceberg tables declared on the Hive Metastore, without the need to add any extra configuration:

xcollazo@stat1007:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
...
spark-sql (default)> select * from xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10;
country	lang	browser_family	os_family	search_engine	num_referrals	day
...
Time taken: 16.535 seconds, Fetched 10 row(s)
spark-sql (default)> select * from wmf.referrer_daily limit 10;
country	lang	browser_family	os_family	search_engine	num_referrals	year	month	day
...
Time taken: 18.836 seconds, Fetched 10 row(s)

Details on how this works at T335721.

For Presto, the story is different.

In T311525, we took care of supporting Iceberg by making it available as its own Presto Catalog:

presto> show catalogs;
      Catalog      
-------------------
 analytics_hive    
 analytics_iceberg 
 system            
(3 rows)

Query 20230515_203952_00138_6vuze, FINISHED, 15 nodes
Splits: 257 total, 257 done (100.00%)
272ms [0 rows, 0B] [0 rows/s, 0B/s]

Unfortunately, Hive tables work on analytics_hive catalog but Iceberg tables fail, and viceversa:

presto:analytics_iceberg> select * from analytics_iceberg.wmf.referrer_daily  limit 10;
Query 20230516_020606_00017_6vuze failed: Not an Iceberg table: wmf.referrer_daily

presto:analytics_iceberg> select * from analytics_hive.wmf.referrer_daily limit 10;
       country        | lang  |  browser_family   | os_family | search_engine | num_referrals | year | mo
----------------------+-------+-------------------+-----------+---------------+---------------+------+---
 ...
(10 rows)

Query 20230516_020622_00018_6vuze, FINISHED, 15 nodes
Splits: 542 total, 537 done (99.08%)
0:05 [7.8K rows, 11.7MB] [1.59K rows/s, 2.39MB/s]

presto:analytics_iceberg> select * from analytics_hive.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10;
Query 20230516_015231_00016_6vuze failed: Unable to create input format org.apache.hadoop.mapred.FileInputFormat

presto:analytics_iceberg> select * from analytics_iceberg.xcollazo_iceberg.referrer_daily_iceberg_part_by_date limit 10;
       country        | lang | browser_family | os_family | search_engine | num_referrals |    day     
----------------------+------+----------------+-----------+---------------+---------------+------------
 ...
(10 rows)

Query 20230516_015215_00015_6vuze, FINISHED, 4 nodes
Splits: 21 total, 18 done (85.71%)
0:01 [15 rows, 31.4KB] [19 rows/s, 40.4KB/s]

AFAICT, this is inherent behavior of the implementation of Iceberg on Presto. Even on the latest Presto release, there doesn't seem to be an equivalent 'hybrid'/wrapper catalog that can handle both Hive and Iceberg tables in one (In contrast to Spark's org.apache.iceberg.spark.SparkSessionCatalog that does support both at the same time).

For completeness, I investigated whether Trino had any better support for this, and the answer seems to be no.

So it looks like we will either have to:
A) live with two separate catalogs in Presto until we migrate all tables to Iceberg, or
B) we will have to come up with our own Presto catalog implementation to be able to support both table types in one. Perhaps the Presto/Iceberg community would be interested in such a contribution.

CC @JAllemandou.

xcollazo added a parent task: T333013: [Iceberg Migration] Apache Iceberg Migration.May 26 2023, 2:23 PM

In our Iceberg Working Session we decided not to pursue (A) nor (B) above. Instead, we will keep Iceberg tables in separate Hive databases to not confuse users when they try and query them with Presto. This concludes the needed work for this ticket.

(We are also discussing a new functional decomposition of the 56 tables currently under wmf, but this is being done at T337562.)

xcollazo moved this task from Next Up to Done on the Data Pipelines (Sprint 14) board.May 26 2023, 9:28 PM

xcollazo closed this task as Resolved.May 30 2023, 7:37 PM

xcollazo mentioned this in T335306: [SPIKE] Evaluation on iceberg sensor for airflow.Jun 7 2023, 7:00 PM

Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg dataClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Make sure new iceberg data can be queried from Presto and Spark as well as non-Iceberg data
Closed, ResolvedPublic1 Estimated Story Points
Actions

Related Objects
Search...