The hive tables analytics_platform_eng.image_suggestions_instanceof_cache, analytics_platform_eng.image_suggestions_title_cache and analytics_platform_eng.image_suggestions_suggestions are using many small files on HDFS, which is an anti-pattern.
HDFS path | # Files | Data size (not replicated) |
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache | 2624750 | 5421414826 |
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache | 2525100 | 7129736283 |
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions | 2521066 | 160342855877 |
Two things I'd like to be done:
- Update the code so that new data gets generated over a smaller number of files. [DONE.]
- Devise a job to read / coalesce / replace old data to reduce the existing footprint. [Avoided by deleting old data and setting a systemd timer]
- Related topic: Should we purge historical data? [DONE.]