Hi,
Over the past 2-3 days, I, and some of my colleagues at EPFL, have observed weird behavior in performance of spark jobs instantiated from jupyter notebooks vs pyspark2 shell on the stat machines. Apart from this difference, everything else is the same: PySpark Yarn (normal) kernel thereby same configs, same driver node, etc. The following query was executed:
query = ("""
select mp.page_id pid, mp.page_title ptitle, mp.page_is_redirect
from wmf_raw.mediawiki_page mp
where mp.wiki_db = '{}'
and mp.snapshot = '{}'
and mp.page_namespace = 0
""")
page = spark.sql(query.format(wiki_project,snapshot_ts))Performing very simple operations such as page.count() or saving this dataframe to a parquet file takes ~30 minutes when triggered from a Jupyter notebook, while less than 1 minute (which is expected) when triggered from pyspark2 shell.
It would be great if someone from analytics can help in identifying the root-cause behind this issue, and also help in getting the same fixed. More details can be provided if needed.
Looking forward, and thanks!
Akhil