We are currently low on available space on Hadoop / HDFS, and the trend shows available space still decreasing. If we don't act now, we will get into trouble. We will address the long term capacity planning and overall strategy on our analytics storage system next calendar year. This task is about the short term actions that need to be taken to avoid storage failure.
This is due to a number of causes happening at the same time. At least:
- temporary increased retention of webrequest to address a bug in Unique Devices calculation logic - T375943
- duplication of some webrequest to support the switch from varnishkafka to haproxy
- duplication of events related to the migration of Refine jobs to Airflow - T356762
- Dumps 2 work requiring additional storage
Short term actions
- validate if we can reduce the webrequest retention without compromising the Unique Device metrics
- validate if we can reduce storage from Dumps 2
- ask individual users to clean up their HDFS home directories (unlikely that we can recover much, individual users seem to have < 7T per home directory)
- review the largest HDFS directories
- /user/analytics-search
- /user/analytics
- /wmf/data/research
- /wmf/data/discovery
- Keep decommissioned presto servers racked, so that we can reuse them in case of emergency (240T of disk = 80T of HDFS space)
- ...



