The hive tables `analytics_platform_eng.image_suggestions_instanceof_cache`, `analytics_platform_eng.image_suggestions_title_cache` and `analytics_platform_eng.image_suggestions_suggestions` are using many small files on HDFS, which is an anti-pattern.
| ** HDFS path** | ** # Files** | ** Data size (not replicated) ** |
| `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache` | 2624750 | 5421414826 |
| `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache` |2525100 | 7129736283 |
| `/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions` |2521066 | 160342855877 |
Two things I'd like to be done:
* Update the code so that new data gets generated over a smaller number of files.
* Devise a job to read / coalesce / replace old data to reduce the existing footprint.
Related topic: Should we purge historical data?