Page MenuHomePhabricator

Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support.
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Jul 24 2023, 10:15 PM
Referenced Files
F37149528: image.png
Jul 25 2023, 3:59 PM
F37149524: image.png
Jul 25 2023, 3:57 PM
F37148762: image.png
Jul 24 2023, 10:15 PM
F37148758: image.png
Jul 24 2023, 10:15 PM

Description

We're seeing an alert at the moment, related to the total nuber of files on the HDFS cluster.

image.png (281×480 px, 32 KB)

The total number of files has jumped in the last three hours from 90.5 million to around 95 million.

image.png (940×1 px, 69 KB)

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-3h&to=now&viewPanel=28

The number that we set in the alert is arbitrary, but it indicates that we may need ot increase the amount of Java heap allocated to the HDFS namenode process.
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size

Event Timeline

BTullis triaged this task as Medium priority.Jul 24 2023, 10:31 PM
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

This is not super-urgent to fix. It might be related to some work on the Iceberg migration by @xcollazo.
I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

We may be able to clear the alert by removing some files, or we might want to bump the heap and adjust the thresholds.

This is definitely me:

xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
           1            0                  0 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_1
       54989      6325689       501194748894 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_2
           2            2              37695 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_3
       19199        57378         6176138585 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_4
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5
          25          157            2915217 /user/hive/warehouse/xcollazo_iceberg.db/referrer_daily_iceberg_part_by_date
           4           72            2522734 /user/hive/warehouse/xcollazo_iceberg.db/referrer_daily_iceberg_part_by_month
        7427      7233158        80241016347 /user/hive/warehouse/xcollazo_iceberg.db/wikitext_raw_rc1

I'll clean up my older experiments, and thank you for the ping; the amount of files being generated explains some of the performance issues I have seen for wikitext_raw_rc1. Will investigate!

Ran a couple of hdfs dfs -rm -r -skipTrash. Things look better now:

xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5

I will however keep experimenting on wikitext_raw_rc1, which was a big culprit here. I will keep an eye on it though!

I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

For reference, that work can be seen at https://superset.wikimedia.org/superset/dashboard/409/

I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located, via analysis of our fsimage.

For reference, that work can be seen at https://superset.wikimedia.org/superset/dashboard/409/

Thanks @xcollazo . That's interesting, I get an error trying to access that dashboard.

image.png (360×1 px, 76 KB)

Maybe it's because I'm not in the analytics-admins POSIX group, although I am in ops.
I'll probably add myself to the group and try again.

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.

Ran a couple of hdfs dfs -rm -r -skipTrash. Things look better now:

xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
       24194       153989        19410637819 /user/hive/warehouse/xcollazo_iceberg.db/dumps_test_5

I will however keep experimenting on wikitext_raw_rc1, which was a big culprit here. I will keep an eye on it though!

Brilliant! Thanks again. I'll resolve this ticket.

image.png (886×1 px, 56 KB)

Change 961698 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bump the namenode heap by 4GB on the Hadoop masters

https://gerrit.wikimedia.org/r/961698

Change 961698 merged by Btullis:

[operations/puppet@production] Bump the namenode heap by 4GB on the Hadoop masters

https://gerrit.wikimedia.org/r/961698

Change 963327 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Bump the maximum number of HDFS files allwoed before triggering an alert

https://gerrit.wikimedia.org/r/963327

Change 963327 merged by Btullis:

[operations/alerts@master] Bump the maximum number of HDFS files before triggering an alert

https://gerrit.wikimedia.org/r/963327