Coalesce SEAL output
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• mfossati
	Oct 30 2023, 10:42 AM

Description

Add a coalesce command line option to all SEAL outputs.
This will control the amount of output files and avoid pressure on the Hadoop cluster.

Implemented as dataframe.coalesce(integer), see image suggestions.

The feature extraction task should be updated first, as it's the one that currently outputs most files.
The prediction task already has the expected behavior.

Details

	Title	Reference	Author	Source Branch	Dest Branch
	coalesce outputs with default workable values	repos/structured-data/seal!2	mfossati	T350009	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T340437 [EPIC] Data pipelines maintenance
		Resolved		• mfossati	T350009 Coalesce SEAL output

Event Timeline

• mfossati created this task.Oct 30 2023, 10:42 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2023, 10:42 AM

CC @JAllemandou .

AUgolnikova-WMF moved this task from Triage to Current Work on the Structured-Data-Backlog board.Oct 30 2023, 4:56 PM

AUgolnikova-WMF edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

AUgolnikova-WMF moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

Skipping estimation: this ticket can be tackled together with T347558: [S] Coalesce section alignment image suggestions output.

mfossati opened https://gitlab.wikimedia.org/repos/structured-data/seal/-/merge_requests/2

coalesce outputs with default workable values

Report

script	coalesce	files before	files after
`sections.py`	8	2049	9
`embeddings.py`	100	1025	101
`features.py`	400	807k	79k

CC @JAllemandou .

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

In T350009#9374856, @JAllemandou wrote:

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

The typical size is roughly 60 GB, with 207 folders holding 401 files (coalesce = 400 + _SUCCESS) each.
no explicit partitioning

Note that the Spark job responsible for this output is the largest and most complex we have among our team's data pipelines, and usually takes 1 day of computation to complete.
The current implementation writes one parquet per wiki, thus resulting in those 207 folders. Modifying this behavior is out of scope, as it would require a lot of work: I had tried a simple solution that writes to a single parquet, causing all sorts of troubles to Spark executors.
Also, further reducing the coalesce value might increase the execution time, which isn't viable neither.

I'm happy to hear any quick suggestions you may have in mind.

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

In T350009#9379753, @JAllemandou wrote:

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

Many thanks for the valuable suggestion! Implemented in this commit with a default value of 4.
Now output files dropped to 1k! 🎉

Now output files dropped to 1k! 🎉

Awesome @mfossati - HDFS and myself thank you very much !

mfossati merged https://gitlab.wikimedia.org/repos/structured-data/seal/-/merge_requests/2

coalesce outputs with default workable values

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2023, 4:21 PM

Pending DAG update & deployment: will do after T347558: [S] Coalesce section alignment image suggestions output is merged, too.

Actually deployed to hotfix an active alert.

mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/sections/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
           1         2049         7313884081 /user/analytics-platform-eng/structured-data/seal/sections/2023-11-20
           1         2049         7344931425 /user/analytics-platform-eng/structured-data/seal/sections/2023-11-27
           1         2049         7344969009 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-04
           1            9         7528323403 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-11
           1            9         7520450888 /user/analytics-platform-eng/structured-data/seal/sections/2023-12-18

mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/embeddings/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
           1         1025        53708244003 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-11-20
           1         1025        53845093411 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-11-27
           1          101        54047251760 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-04
           1          101        54047398565 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-11
           1          101        54047703019 /user/analytics-platform-eng/structured-data/seal/embeddings/2023-12-18

mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/seal/features/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
...
         207       807127        71551760658 /user/analytics-platform-eng/structured-data/seal/features/2023-11-20
         207       807127        71828400874 /user/analytics-platform-eng/structured-data/seal/features/2023-11-27
         207         1003        91087869243 /user/analytics-platform-eng/structured-data/seal/features/2023-12-04
         207         1003        91098395970 /user/analytics-platform-eng/structured-data/seal/features/2023-12-11
         207         1003        91076634616 /user/analytics-platform-eng/structured-data/seal/features/2023-12-18

Everything looks good, closing.

Coalesce SEAL outputClosed, ResolvedPublicActions