We observe OutOfMemoryError when executing spark queries to get:
- all the revision_ids with a template X
- revision_id where the template was removed for each page_id
At the moment, spark query return a pandas dataframe directly, in the current script.
To do:
We will try piping the spark query result to hadoop file system. Then, loading it as pandas dataframe.
Acceptance criteria:
We are able to generate csv for all the templates of interest and template removal duration stats, without memory issues.