The merge request at https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/51 is the key step of the image suggestions data pipeline onboarding as per https://www.mediawiki.org/wiki/Platform_Engineering_Team./Data_Value_Stream/Data_Pipeline_Onboarding/#Onboarding.
The following list specifies pending tasks besides T307362 and T307371.
Tasks
[ ] The cleanup script fails due to missing Spark session: run it as a non-Spark task- not needed anymore, superseded by T307983: Write search index data for image suggestions into a hive table rather than local hdfs files- fix the schedule_interval
cron expression- actually due to parameters wrongly passed to default_args, see https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/commit/fd23d94a6979e36d5742c8f685c7de3dad3b462e - add explicit descriptions in DataFrame.write calls for better monitoring on https://yarn.wikimedia.org/cluster/scheduler - see https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/commit/4d0e458fa9030ebccb59a3c39e8c3ef13699fbdc
- fix Hive connection error, see https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/55#note_6700 - caused by https://stackoverflow.com/a/30707252/10719765, fix at https://gerrit.wikimedia.org/r/c/operations/puppet/+/791612