Change Details

The hive tables `analytics_platform_eng.image_suggestions_instanceof_cache`, `analytics_platform_eng.image_suggestions_title_cache` and `analytics_platform_eng.image_suggestions_suggestions` are using many small files on HDFS, which is an anti-pattern. | ** HDFS path** | ** # Files** | ** Data size (not replicated) ** | | `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache` | 2624750 | 5421414826 | | `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache` |2525100 | 7129736283 | | `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions` |2521066 | 160342855877 | Two things I'd like to be done: *[x] Update the code so that new data gets generated over a smaller number of files. *[] Devise a job to read / coalesce / replace old data to reduce the existing footprint. Related topic: Should we purge historical data?