Optimize data processing & collection by piping spark query to hdfs
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	prabhat
	Apr 4 2023, 8:26 PM

Description

We observe OutOfMemoryError when executing spark queries to get:

At the moment, spark query return a pandas dataframe directly, in the current script.

To do:
We will try piping the spark query result to hadoop file system. Then, loading it as pandas dataframe.

Acceptance criteria:
We are able to generate csv for all the templates of interest and template removal duration stats, without memory issues.

Status	Assigned	Task
Open	None	T324582 Breaking News {Content Integrity}
Resolved	JArguello-WMF	T334052 Content Integrity Big Data Analysis
Resolved	None	T334018 Optimize data processing & collection by piping spark query to hdfs

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2023, 8:26 PM