Page MenuHomePhabricator

Performance Issues when running Spark/Hive jobs via Jupyter Notebooks
Open, Needs TriagePublic



Over the past 2-3 days, I, and some of my colleagues at EPFL, have observed weird behavior in performance of spark jobs instantiated from jupyter notebooks vs pyspark2 shell on the stat machines. Apart from this difference, everything else is the same: PySpark Yarn (normal) kernel thereby same configs, same driver node, etc. The following query was executed:

query = ("""
select mp.page_id pid, mp.page_title ptitle, mp.page_is_redirect
from wmf_raw.mediawiki_page mp
where mp.wiki_db = '{}'
and mp.snapshot = '{}'
and mp.page_namespace = 0

page = spark.sql(query.format(wiki_project,snapshot_ts))

Performing very simple operations such as page.count() or saving this dataframe to a parquet file takes ~30 minutes when triggered from a Jupyter notebook, while less than 1 minute (which is expected) when triggered from pyspark2 shell.

It would be great if someone from analytics can help in identifying the root-cause behind this issue, and also help in getting the same fixed. More details can be provided if needed.

Looking forward, and thanks!

Event Timeline

Can you try the Spark Yarn large kernel?

elukey added a subscriber: elukey.Aug 3 2020, 4:29 PM

Quick follow up - the main difference between the "regular" shell and the notebooks is that the former doesn't have a max spark workers settings, meanwhile the latter is limited (we have normal/large/etc.. notebooks). When you get a one minute result I am pretty sure that it is thanks to the usage of a ton of spark workers (very easy to check from the Yarn application id, we can do a quick test) that may saturate the cluster if other big jobs are running. Please read :)

I tried using both regular and large kernel, and the issue persisted. In fact, I did this debugging with @JAllemandou and the issue persisted, post which, he asked me to raise a ticket!

elukey added a comment.Aug 3 2020, 4:33 PM

@Aroraakhil do you have an example of Yarn application id related to a query that takes a minute? (from the shell)

I am curious to see how many workers it uses..

Milimetric added a subscriber: Milimetric.EditedAug 3 2020, 4:38 PM

Another important detail is that the wmf_raw database usually stores files in avro format, because it's faster to import that way. As data is processed, we store it in Parquet format (way more efficient columnar format, faster queries when using a subset of columns). So perhaps a better table to query would be wmf.mediawiki_page_history. Documented here:, as an example, your query would be:

(the only slight complication is that each page has multiple rows)

select page_id pid,
       -- also take a look at page_title_historical, might be what you want
       page_title ptitle,
  from wmf.mediawiki_page_history
 where wiki_db = '{}'
   and snapshot = '{}'
   -- better than page_namespace = 0 because, for example, commonswiki considers ns 6 as content (the File: namespace)
   and page_namespace_is_content
   -- this is how you identify the last "state" of each page in the page_history table (there is only one per page)
   and end_timestamp is null