Page MenuHomePhabricator

Low available space on Hadoop / HDFS
Closed, ResolvedPublic

Description

We are currently low on available space on Hadoop / HDFS, and the trend shows available space still decreasing. If we don't act now, we will get into trouble. We will address the long term capacity planning and overall strategy on our analytics storage system next calendar year. This task is about the short term actions that need to be taken to avoid storage failure.

This is due to a number of causes happening at the same time. At least:

  • temporary increased retention of webrequest to address a bug in Unique Devices calculation logic - T375943
  • duplication of some webrequest to support the switch from varnishkafka to haproxy
  • duplication of events related to the migration of Refine jobs to Airflow - T356762
  • Dumps 2 work requiring additional storage

Short term actions

  • validate if we can reduce the webrequest retention without compromising the Unique Device metrics
  • validate if we can reduce storage from Dumps 2
  • ask individual users to clean up their HDFS home directories (unlikely that we can recover much, individual users seem to have < 7T per home directory)
  • review the largest HDFS directories
    • /user/analytics-search
    • /user/analytics
    • /wmf/data/research
    • /wmf/data/discovery
  • Keep decommissioned presto servers racked, so that we can reuse them in case of emergency (240T of disk = 80T of HDFS space)
  • ...

Links

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Temporarily delete snapshots at 7 day mark for wikitext_raw table.repos/data-engineering/airflow-dags!980xcollazotemp-snapshot-delete-at-7-daysmain
Customize query in GitLab

Related Objects

Event Timeline

Gehel updated the task description. (Show Details)

For search-platform i reviewed the directories, we essentially have the following data. Data sizes are per hdfs dfs -du and should include replication:

  • discovery.cirrus_index is 15TB. This keeps 4 historical imports, it could be reduced to 3 or maybe even 2.
  • discovery.cirrus_index_without_content is also 15TB, apparently drop_mediawiki_snapshots.py (from discolytics) is not properly dropping old data. We should be able to fix this and save 14TB.
  • 25T of wikidata dumps.
  • 25T in /user/analytics-search/.Trash. I suspect the way we cleanup historical data, maybe via drop_mediawiki_snapshots.py or refinery-drop-older-than is moving data into the trash rather than actually deleting. 30 days of deleted data apparently adds up to 25TB.
  • 7TB in everything else.

Not sure what to do with these, looking into the snapshot's not being dropped and the excessive Trash seems potentially worthwhile.

The dumps 2.0 intermediate table is currently hoarding these many bytes:

xcollazo@stat1011:~$ hdfs dfs -count /wmf/data/wmf_dumps/wikitext_raw_rc2/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
      194131      2303606    151151869470330 /wmf/data/wmf_dumps/wikitext_raw_rc2/data
           1        10604        44165993108 /wmf/data/wmf_dumps/wikitext_raw_rc2/metadata

So ~137TBs of data.

Considering we need the space, let's delete all snapshots older than 7 days, even though we typically want to have the last 90 days here.

ssh an-launcher1002.eqiad.wmnet
sudo -u analytics bash

kerberos-run-command analytics spark3-sql \
--driver-cores 8 \
--master yarn \
--conf spark.dynamicAllocation.maxExecutors=128 \
--conf spark.executor.memoryOverhead=3072  \
--executor-memory 8G \
--executor-cores 2 \
--driver-memory 32G


spark-sql (default)> SELECT NOW() - INTERVAL 7 DAYS;
CAST(now() - INTERVAL '7 days' AS TIMESTAMP)
2024-12-02 20:35:30.395
Time taken: 0.974 seconds, Fetched 1 row(s)



spark-sql (default)> CALL spark_catalog.system.expire_snapshots(
                   >   table => 'wmf_dumps.wikitext_raw_rc2',
                   >   older_than => TIMESTAMP '2024-12-02 20:35:30.395',
                   >   max_concurrent_deletes => 50,
                   >   stream_results => true
                   > );
24/12/09 20:38:05 WARN NioEventLoop: Selector.select() returned prematurely 512 times in a row; rebuilding Selector io.netty.channel.nio.SelectedSelectionKeySetSelector@106e6234.
deleted_data_files_count	deleted_manifest_files_count	deleted_manifest_lists_count
2046893	8860	1319
Time taken: 737.888 seconds, Fetched 1 row(s)

And we now have:

xcollazo@stat1011:~$ hdfs dfs -count /wmf/data/wmf_dumps/wikitext_raw_rc2/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
      194131       256713     29068054308835 /wmf/data/wmf_dumps/wikitext_raw_rc2/data
           1          425         1673207239 /wmf/data/wmf_dumps/wikitext_raw_rc2/metadata

That's ~26TB, thus we saved a nice ~111TBs. Happy holidays! :D

Screenshot 2024-12-10 at 09.15.28.png (1×2 px, 350 KB)
Beautiful! We reclaimed about 8% of free space!

Gehel triaged this task as High priority.Dec 10 2024, 8:07 PM

The current rate of capacity consumption seems to be around 10% / month since October 1. If this stays stable, we'll be below 10% of capacity before we are fully back from the end of year holiday. This seems to close to the limit to be comfortable.

Note: While the unique devices backfill and the Dumps 2 renaming is in progress before the holiday break, expect a higher capacity consumption rate.
Both efforts will most likely by 12/18, latest 12/20 release the extra consumed data.

As part of the DPE SRE / DE sync up meeting the following mid-term solutions could be considered:

  • Dumps 2
    • Short term option: retain only 30+ days for initial adaptation phase (although 120 is speced / communicated)
  • Data storage efficiencies
    • Change snappy to gzip or other compression for less frequently accessed data (re-computation efforts unclear yet)
    • Reduce replication factor on older data
  • Leverage Ceph capacity with Iceberg tables
    • Requires performance testing
    • Rack proximity might be a problem … data locality / network topology
    • Ceph space available: 400-500TB (non replicated)
  • Park cold storage data on Ceph
    • To be investigated … no dedicated cold storage space
    • Note: Newer versions Hadoop support hot / cold concept (within HDFS)
  • Repurpose retirement nodes into Hadoop nodes
    • 1.5TB x 80 = 7%
    • Next Hadoop workers will have double the space (rolling refresh cycle)
  • Make a tactical investment in hard drive
    • 100 Hadoop workers, swap out 1000 8TB drive over course of several months

We're still using up the available space much too quickly to survive the holiday. Currently at 12% and we're consuming space at about 1.5% per day.

image.png (913×1 px, 96 KB)

We have now added about 4% to the total capacity by adding the old an-presto100[1-5] servers to the cluster.

image.png (910×1 px, 68 KB)

BTullis added a subtask: Unknown Object (Task).Dec 19 2024, 12:16 PM

See {T382372} for the discussion with DC-Ops about a mass hard drive upgrade for 70 active Hadoop worker nodes.
We are considering the relative costs of upgrading 840 drives to either 8TB or 16TB.

Reclaiming additionally retained webrequest data got us back to 28% free disk space.

Screenshot 2024-12-20 at 10.20.49 AM.png (2×3 px, 995 KB)

Closing this task as we've been able to reduce usage sufficiently (reducing the retention of web requests to the normal 90 days). We still need a longer term strategy around HDFS storage.

Gehel claimed this task.
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 3 2025, 7:21 PM