Page MenuHomePhabricator

HDFS space usage steadily increased over the past month
Closed, ResolvedPublic

Description

Today I noticed a warning in icinga about HDFS space used, we crossed the 2PB mark:

Screen Shot 2019-11-19 at 12.22.17 PM.png (1×2 px, 144 KB)

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&panelId=25&fullscreen&from=now-90d&to=now

The last 90d view shows that something changed during the past month, more or less from the last week of October onwards.

elukey@stat1004:~$ sudo -u hdfs hdfs dfs -du -h /
24       768 M    /system
1.2 T    3.7 T    /tmp
66.0 T   198.6 T  /user
73.3 T   219.9 T  /var
547.2 T  1.6 P    /wmf

The /user dir contains some big home dirs:

1.3 T /user/mforns
1.4 T /user/otto
2.5 T /user/west1
2.9 T /user/nuria
4.4 T /user/ebernhardson
5.7 T /user/ezachte
6.0 T /user/nathante
7.6 T /user/dsaez
9.5 T /user/milimetric
9.8 T /user/halfak
14.7 T /user/piccardi
16.4 T /user/joal
23.6 T /user/leila
26.7 T /user/druid
61.4 T /user/hive

And the /var dir contains /var/log/hadoop-yarn/apps logs.

There seems to be ~300T of replicated data for the past month (~1.7PB to ~2PB), so ~100T un replicated. Since the trend seems that the space used is increasing, let's figure out what is causing it.

Event Timeline

elukey triaged this task as High priority.Nov 19 2019, 11:31 AM
elukey created this task.

Mentioned in SAL (#wikimedia-analytics) [2019-11-19T13:46:20Z] <joal> Deleting 100 heavier log-folders from analytics user (cassandra backfilling logs) -- T238648

Mentioned in SAL (#wikimedia-analytics) [2019-11-19T13:46:51Z] <joal> Deleting old parquet wikitext data (new data is stored in Avro) -- T238648

Mentioned in SAL (#wikimedia-analytics) [2019-11-19T13:54:45Z] <joal> Deleting 600 more log-folders from analytics user (cassandra backfilling logs) -- T238648

Joseph found the root cause, namely mediarequests backfilling creating huge files due to cassandra debug logging (T236698).

fdans claimed this task.