Page MenuHomePhabricator

Optimize data processing & collection by piping spark query to hdfs
Closed, ResolvedPublic

Description

We observe OutOfMemoryError when executing spark queries to get:

  • all the revision_ids with a template X
  • revision_id where the template was removed for each page_id

At the moment, spark query return a pandas dataframe directly, in the current script.

To do:
We will try piping the spark query result to hadoop file system. Then, loading it as pandas dataframe.

Acceptance criteria:
We are able to generate csv for all the templates of interest and template removal duration stats, without memory issues.