Page MenuHomePhabricator

[Iceberg Migration] Migrate aqs hourly tables to Iceberg
Closed, ResolvedPublic5 Estimated Story Points

Description

Migrate aqs hourly tables to Iceberg format.

See this ticket for details of the referrer_daily migration https://phabricator.wikimedia.org/T335305

Success Criteria

  • aqs hourly tables are migrated to Iceberg format (Hive tables will remain for now - there will be a separate ticket to deprecate that will require coordination with consumers)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add iceberg version of aqs_hourly dagrepos/data-engineering/airflow-dags!569tchinaqs-hourly-icebergmain
Customize query in GitLab

Event Timeline

tchin changed the task status from Open to In Progress.Dec 14 2023, 6:16 AM

Change 982869 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery@master] [WIP] Add iceberg version of aqs_hourly table

https://gerrit.wikimedia.org/r/982869

Tested on a stat machine with

CREATE EXTERNAL TABLE IF NOT EXISTS `aqs_hourly`(  
    `cache_status`      string     COMMENT 'Cache status',  
    `http_status`       string     COMMENT 'HTTP status of response',  
    `http_method`       string     COMMENT 'HTTP method of request',  
    `response_size`     bigint     COMMENT 'Response size',  
    `uri_host`          string     COMMENT 'Host of request',  
    `uri_path`          string     COMMENT 'Path of request',  
    `request_count`     bigint     COMMENT 'Number of requests',  
    `hour`              timestamp  COMMENT 'The aggregated hour. Covers from minute 00 to 59'  
)  
USING ICEBERG
PARTITIONED BY (days(hour))
;

And

spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 \
-f aqs_hourly_iceberg.hql  \
-d source_table=wmf.webrequest \
-d webrequest_source=text \
-d destination_table=tchin.aqs_hourly \
-d coalesce_partitions=1 \
-d year=2023 \
-d month=12 \
-d day=3 \
-d hour=0

Checked that row count is the same

spark-sql (default)> use tchin;
spark-sql (default)> select count(1) from aqs_hourly;
count(1)
523018

spark-sql (default)> select count(1) from wmf.aqs_hourly where year=2023 and month=12 and day=3 and hour=0;
count(1)
523018

Tested to see if the COALESCE hints still work in Iceberg by creating 2 tables and filling then with/without the hint. It still seems to work.

tchin@stat1004:~$ ls -a /mnt/hdfs/user/hive/warehouse/tchin.db/aqs_hourly_coalesce/data/hour_day\=2023-12-03/ | grep parquet | wc -l
1
tchin@stat1004:~$ ls -a /mnt/hdfs/user/hive/warehouse/tchin.db/aqs_hourly_no_coalesce/data/hour_day\=2023-12-03/ | grep parquet | wc -l
200

Change 982869 merged by Aqu:

[analytics/refinery@master] Add iceberg version of aqs_hourly table

https://gerrit.wikimedia.org/r/982869

Change 993478 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery@master] Use zstd compression for aqs_hourly

https://gerrit.wikimedia.org/r/993478

Ahoelzl set the point value for this task to 5.Jan 30 2024, 7:14 PM

Change 993478 merged by Phuedx:

[analytics/refinery@master] Use zstd compression for aqs_hourly

https://gerrit.wikimedia.org/r/993478