If you don't know about the Iceberg format, please read some docs (https://iceberg.apache.org/)
- Prepare the data model
- we wish to partition the data in iceberg by time using hidden partitioning and partition size to be defined based on datasize.
- The dataset might be missing a timestamp field as time data would then be encoded in the partition values.
- We then need to add a timestamp field to the new table
- Convert old data to the new data model in iceberg
- We will duplicate the dataset creating a new table with the iceberg format and new model
- We will load the old data onto the new table as a spark one-off.
- Add a new airflow job to automatically insert new data into the new table
After T335306 and T335314 are done and we know we can deprecate the old dataset:
- Communicate the deprecation
- Update airflow jobs depedent on the non-iceberg table to use the iceberg table
- Deprecate the old dataset
Edit: Let's do deprecation and removal on a separate task. This way, we could compartmentalize these deprecations by table groups (say, for all migrated to wmf_traffic.).
For this first exercise, we have chosen a small dataset referrer_daily : https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/referrer_daily