Page MenuHomePhabricator

Modify Readers table partitions
Closed, DeclinedPublic

Description

Is there any negative impact on Hive performance and HDFS if you have a table that's partitioned (year, month, day) but each day only has 1 or 3 rows of data?

From jallemandou

  • TL;DR: Best file size for our compute tools is from 120Mb and up (edited)
  • Now some details: small files mean more disk-seek in opposition to sequential read for big files - the latter is a lot more performat hardware wise (not as true for SSDs, but we don't have them)
  • Also, parquet optimizes data storage and querying through various tricks, main ones being 1) per-file metadata in footer allowing to not even read the raw data to answer some queries 2) columnar data organization to not read columns you don't need 3) columnar compression making it easier to group-by or filter values - All those take advantage of files not being small, so that the specific storage applies to a certain of amount of data, in opposition to almost no data when in small files
  • More partitions also means more metadata to be handle by the hive metastore, and therefore more pre-computation to define what to read
  • And finally small files mean more work for HDFS as there a lot more metadata to be stored for a file than for a block (making many smaller files into a bigger one might mean more blocks, but blocks are cheaper as there is no path/right/acl management)

Refer slack chat

Event Timeline

kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.
nshahquinn-wmf subscribed.

As part of T359207, I will be converting these intermediate tables to use Iceberg, which removes the need for partitioning.