Create user-focused Spark SQL documentation
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Milimetric
	Feb 13 2023, 5:17 PM

Description

Documentation at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark is primarly written for system admins and power users; This Slack thread (WMF-internal; key points summarized below) surfaced the need for more user-centric documentation of how to use WMF's implementation of Spark.

User story:
As a user of the data platform, I need to know what my options are for running Spark SQL queries, and how to pick the right option for use cases that involve large datasets and long-running queries.

Notes/quotes from original slack thread by @CDanis and others:
cdanis: "I'd love a user-centric intro to the various options available. Linking out to other docs for details is fine but, as someone whose only encounter with a lot of these tools has been their installations at WMF, it's hard to know what we do and don't have here and what the "usual" patterns should be for usage"
milimetric: "There are some mentions on wikitech but not in a user-centric way, I think we have to refactor
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#How_do_I_... with an understanding that more folks are actually finally using this stuff :)"

Options and info mentioned in the thread:

There's no easy way to run Spark SQL queries via Hue
It is possible to run them via Superset, but it would take a bit of manual work to set up.
If the dataset you're querying isn't too big, you can use presto via superset (webrequest: too big)
You can run Spark SQL queries in jupyter notebooks...but there are some issues with long-running queries in Jupyterhub failing to continuously output results: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter#Browser_disconnects
For long-running Spark jobs/queries, other option sare:
- write the output to a file instead of printing to screen
- dump output dataset out as Parquet, i.e. use df.write.parquet(...) and put it into your user directory on hdfs
- if it's a small dataset, use DataFrame.toPandas, that will materialize the dataframe into a python-native local dataframe (different to work with than spark b/c not sql-ish, but some datasets may still be too large for that)
- use wmfdata (outputs a pandas dataframe, makes writing to a file easier) in a py script or a notebook, have it notify you by email at the end, and run the whole thing inside of a screen session. See https://nbviewer.org/github/wikimedia/wmfdata-python/blob/main/docs/quickstart.ipynb#Spark for using wmfdata's Spark module. Here's the command if you use a notebook:

jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook

the bare metal way: start a spark3-sql --master yarn shell with the appropriate resourcing, run query in there (usually writing results out to a new table in my own db). Short of not having enough resources or the server getting rebooted, this basically can't fail!
Some users prefer the CLI / spark3-shell (scala shell), or just pyspark3 for python

Related docs and references:

Presentation on growing the data lake with pyspark (Google Slides) has a lot of good examples mixing spark sql with pyspark
https://spark.apache.org/docs/latest/quick-start.html
https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
https://nbviewer.org/github/wikimedia/wmfdata-python/blob/main/docs/quickstart.ipynb

Related Objects

Mentioned Here: T193296: Consolidate and improve data usage documentation for WMF-generated data

Event Timeline

Milimetric created this task.Feb 13 2023, 5:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2023, 5:17 PM

CDanis subscribed.Feb 13 2023, 5:19 PM

lbowmaker edited projects, added Data-Engineering-Planning; removed Data-Engineering.Feb 17 2023, 6:26 PM

lbowmaker moved this task from Backlog to To be Discussed on the Data-Engineering-Planning board.

JArguello-WMF edited projects, added Data-Engineering, Documentation; removed Data-Engineering-Planning.Jun 29 2023, 9:19 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:20 PM

TBurmeister subscribed.Oct 5 2023, 6:18 PM

TBurmeister claimed this task.Oct 5 2023, 6:20 PM

TBurmeister triaged this task as Medium priority.

TBurmeister updated the task description. (Show Details)

TBurmeister added a parent task: T350911: Redesign Data Platform docs on Wikitech.Nov 15 2023, 6:15 PM

I added a new section with a couple links at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Use_PySpark_to_run_SQL_on_Hive_tables. I also updated the task description to capture the key info from the original Slack thread and make it easier for someone else to pick up and work on this task (it's out of scope for my work on Data Platform docs this quarter, so I'm unassigning myself).

TBurmeister removed a parent task: T350911: Redesign Data Platform docs on Wikitech.Fri, Jun 14, 3:28 PM

Create user-focused Spark SQL documentationOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Create user-focused Spark SQL documentation
Open, MediumPublic
Actions