Page MenuHomePhabricator

Coalesce SEAL output
Closed, ResolvedPublic

Description

Add a coalesce command line option to all SEAL outputs.
This will control the amount of output files and avoid pressure on the Hadoop cluster.

Implemented as dataframe.coalesce(integer), see image suggestions.

The feature extraction task should be updated first, as it's the one that currently outputs most files.
The prediction task already has the expected behavior.

Details

TitleReferenceAuthorSource BranchDest Branch
coalesce outputs with default workable valuesrepos/structured-data/seal!2mfossatiT350009main
Customize query in GitLab

Event Timeline

mfossati changed the task status from Open to In Progress.Nov 27 2023, 9:56 AM
mfossati claimed this task.

Skipping estimation: this ticket can be tackled together with T347558: [S] Coalesce section alignment image suggestions output.

Report

scriptcoalescefiles beforefiles after
sections.py820499
embeddings.py1001025101
features.py400807k79k

CC @JAllemandou .

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

  • The typical size is roughly 60 GB, with 207 folders holding 401 files (coalesce = 400 + _SUCCESS) each.
  • no explicit partitioning

Note that the Spark job responsible for this output is the largest and most complex we have among our team's data pipelines, and usually takes 1 day of computation to complete.
The current implementation writes one parquet per wiki, thus resulting in those 207 folders. Modifying this behavior is out of scope, as it would require a lot of work: I had tried a simple solution that writes to a single parquet, causing all sorts of troubles to Spark executors.
Also, further reducing the coalesce value might increase the execution time, which isn't viable neither.

I'm happy to hear any quick suggestions you may have in mind.

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

Many thanks for the valuable suggestion! Implemented in this commit with a default value of 4.
Now output files dropped to 1k! 🎉

Now output files dropped to 1k! 🎉

Awesome @mfossati - HDFS and myself thank you very much !

Actually deployed to hotfix an active alert.

mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/sections/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
           1         2049         7313884081 /user/analytics-platform-eng/structured-data/seal/sections/2023-11-20
           1         2049         7344931425 /user/analytics-platform-eng/structured-data/seal/sections/2023-11-27
           1         2049         7344969009 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-04
           1            9         7528323403 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-11
           1            9         7520450888 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-18
mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/embeddings/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
           1         1025        53708244003 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-11-20
           1         1025        53845093411 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-11-27
           1          101        54047251760 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-04
           1          101        54047398565 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-11
           1          101        54047703019 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-18
mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/features/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
         207       807127        71551760658 /user/analytics-platform-eng/structured-data/seal/features/2023-11-20
         207       807127        71828400874 /user/analytics-platform-eng/structured-data/seal/features/2023-11-27
         207         1003        91087869243 /user/analytics-platform-eng/structured-data/seal/features/2023-12-04
         207         1003        91098395970 /user/analytics-platform-eng/structured-data/seal/features/2023-12-11
         207         1003        91076634616 /user/analytics-platform-eng/structured-data/seal/features/2023-12-18

Everything looks good, closing.