Modify Readers table partitions
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Mayakp.wiki
	Jul 12 2022, 5:09 PM

Description

Is there any negative impact on Hive performance and HDFS if you have a table that's partitioned (year, month, day) but each day only has 1 or 3 rows of data?

From jallemandou

TL;DR: Best file size for our compute tools is from 120Mb and up (edited)
Now some details: small files mean more disk-seek in opposition to sequential read for big files - the latter is a lot more performat hardware wise (not as true for SSDs, but we don't have them)
Also, parquet optimizes data storage and querying through various tricks, main ones being 1) per-file metadata in footer allowing to not even read the raw data to answer some queries 2) columnar data organization to not read columns you don't need 3) columnar compression making it easier to group-by or filter values - All those take advantage of files not being small, so that the specific storage applies to a certain of amount of data, in opposition to almost no data when in small files
More partitions also means more metadata to be handle by the hive metastore, and therefore more pre-computation to define what to read
And finally small files mean more work for HDFS as there a lot more metadata to be stored for a file than for a block (making many smaller files into a bigger one might mean more blocks, but blocks are cheaper as there is no path/right/acl management)

Refer slack chat