Page MenuHomePhabricator

Create user-focused Spark SQL documentation
Open, MediumPublic

Description

Documentation at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark is primarly written for system admins and power users; This Slack thread (WMF-internal; key points summarized below) surfaced the need for more user-centric documentation of how to use WMF's implementation of Spark.

User story:
As a user of the data platform, I need to know what my options are for running Spark SQL queries, and how to pick the right option for use cases that involve large datasets and long-running queries.

Notes/quotes from original slack thread by @CDanis and others:
cdanis: "I'd love a user-centric intro to the various options available. Linking out to other docs for details is fine but, as someone whose only encounter with a lot of these tools has been their installations at WMF, it's hard to know what we do and don't have here and what the "usual" patterns should be for usage"
milimetric: "There are some mentions on wikitech but not in a user-centric way, I think we have to refactor
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#How_do_I_... with an understanding that more folks are actually finally using this stuff :)"

Options and info mentioned in the thread:

  • There's no easy way to run Spark SQL queries via Hue
  • It is possible to run them via Superset, but it would take a bit of manual work to set up.
  • If the dataset you're querying isn't too big, you can use presto via superset (webrequest: too big)
  • You can run Spark SQL queries in jupyter notebooks...but there are some issues with long-running queries in Jupyterhub failing to continuously output results: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter#Browser_disconnects
  • For long-running Spark jobs/queries, other option sare:
    • write the output to a file instead of printing to screen
    • dump output dataset out as Parquet, i.e. use df.write.parquet(...) and put it into your user directory on hdfs
    • if it's a small dataset, use DataFrame.toPandas, that will materialize the dataframe into a python-native local dataframe (different to work with than spark b/c not sql-ish, but some datasets may still be too large for that)
    • use wmfdata (outputs a pandas dataframe, makes writing to a file easier) in a py script or a notebook, have it notify you by email at the end, and run the whole thing inside of a screen session. See https://nbviewer.org/github/wikimedia/wmfdata-python/blob/main/docs/quickstart.ipynb#Spark for using wmfdata's Spark module. Here's the command if you use a notebook:
jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook
  • the bare metal way: start a spark3-sql --master yarn shell with the appropriate resourcing, run query in there (usually writing results out to a new table in my own db). Short of not having enough resources or the server getting rebooted, this basically can't fail!
  • Some users prefer the CLI / spark3-shell (scala shell), or just pyspark3 for python

Related docs and references:

Event Timeline

TBurmeister triaged this task as Medium priority.
TBurmeister updated the task description. (Show Details)
TBurmeister renamed this task from User-centric documentation links to Create user-focused Spark SQL documentation.Fri, Jun 14, 3:17 PM
TBurmeister removed TBurmeister as the assignee of this task.
TBurmeister updated the task description. (Show Details)

I added a new section with a couple links at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Use_PySpark_to_run_SQL_on_Hive_tables. I also updated the task description to capture the key info from the original Slack thread and make it easier for someone else to pick up and work on this task (it's out of scope for my work on Data Platform docs this quarter, so I'm unassigning myself).