Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Jul 24 2023, 10:15 PM

Description

We're seeing an alert at the moment, related to the total nuber of files on the HDFS cluster.

The total number of files has jumped in the last three hours from 90.5 million to around 95 million.

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-3h&to=now&viewPanel=28

The number that we set in the alert is arbitrary, but it indicates that we may need ot increase the amount of Java heap allocated to the HDFS namenode process.
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size

Details

	Subject	Repo	Branch	Lines +/-
	Bump the maximum number of HDFS files before triggering an alert	operations/alerts	master	+3 -3
	Bump the namenode heap by 4GB on the Hadoop masters	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T330176: [Data Platform] Deploy Spark History Service
T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0
T344266: Bump Spark to 3.3.x or 3.4.x line.
T340861: Implement a backfill job for the dumps hourly table

Event Timeline

BTullis created this task.Jul 24 2023, 10:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2023, 10:15 PM

This is not super-urgent to fix. It might be related to some work on the Iceberg migration by @xcollazo.
I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

We may be able to clear the alert by removing some files, or we might want to bump the heap and adjust the thresholds.

This is definitely me:

xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
           1            0                  0 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_1
       54989      6325689       501194748894 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_2
           2            2              37695 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_3
       19199        57378         6176138585 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_4
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5
          25          157            2915217 /user/hive/warehouse/xcollazo_iceberg.db/referrer_daily_iceberg_part_by_date
           4           72            2522734 /user/hive/warehouse/xcollazo_iceberg.db/referrer_daily_iceberg_part_by_month
        7427      7233158        80241016347 /user/hive/warehouse/xcollazo_iceberg.db/wikitext_raw_rc1

I'll clean up my older experiments, and thank you for the ping; the amount of files being generated explains some of the performance issues I have seen for wikitext_raw_rc1. Will investigate!

Ran a couple of hdfs dfs -rm -r -skipTrash. Things look better now:

xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5

I will however keep experimenting on wikitext_raw_rc1, which was a big culprit here. I will keep an eye on it though!

I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

For reference, that work can be seen at https://superset.wikimedia.org/superset/dashboard/409/

BTullis claimed this task.Jul 25 2023, 3:34 PM

In T342587#9041730, @xcollazo wrote:

I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

For reference, that work can be seen at https://superset.wikimedia.org/superset/dashboard/409/

Thanks @xcollazo . That's interesting, I get an error trying to access that dashboard.

Maybe it's because I'm not in the analytics-admins POSIX group, although I am in ops.
I'll probably add myself to the group and try again.

In T342587#9041680, @xcollazo wrote:
Ran a couple of hdfs dfs -rm -r -skipTrash. Things look better now:
xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5
I will however keep experimenting on wikitext_raw_rc1, which was a big culprit here. I will keep an eye on it though!

Brilliant! Thanks again. I'll resolve this ticket.

xcollazo mentioned this in T340861: Implement a backfill job for the dumps hourly table.Aug 15 2023, 3:31 PM

xcollazo mentioned this in T344266: Bump Spark to 3.3.x or 3.4.x line..Aug 15 2023, 3:42 PM

xcollazo mentioned this in T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0.Aug 15 2023, 8:12 PM

xcollazo mentioned this in T330176: [Data Platform] Deploy Spark History Service.Aug 16 2023, 4:35 PM

Change 961698 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bump the namenode heap by 4GB on the Hadoop masters

https://gerrit.wikimedia.org/r/961698

gerritbot added a project: Patch-For-Review.Sep 28 2023, 8:51 AM

Change 961698 merged by Btullis:

[operations/puppet@production] Bump the namenode heap by 4GB on the Hadoop masters

https://gerrit.wikimedia.org/r/961698

Maintenance_bot removed a project: Patch-For-Review.Sep 28 2023, 10:30 AM

Change 963327 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Bump the maximum number of HDFS files allwoed before triggering an alert

https://gerrit.wikimedia.org/r/963327

gerritbot added a project: Patch-For-Review.Oct 4 2023, 2:39 PM

Change 963327 merged by Btullis:

[operations/alerts@master] Bump the maximum number of HDFS files before triggering an alert