Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Feb 21 2020, 11:51 PM

Description

Due to introduction of Kerberos authentication, wmfdata switched to using PySpark to run SQL queries against Hive databases.

However, we are encountering some significant issues with Spark as a backend; this task tracks those issues.

Related Objects
Search...

Status	Assigned	Task
Resolved	nshahquinn-wmf	T245891 Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
Resolved	nshahquinn-wmf	T245097 Update wmfdata to use sensible Spark settings
Resolved	nshahquinn-wmf	T245713 wmfdata cannot recover from a crashed Spark session
Declined	None	T245892 Spark application UI shows data for different application
Declined	nshahquinn-wmf	T245896 Spark applications crash when running large queries
Resolved	JAllemandou	T245897 Give clear recommendations for Spark settings
Resolved	nshahquinn-wmf	T246060 Update wmfdata to support multiple SQL engines for Hive databases
Resolved	elukey	T246132 Spark sessions can provision kerberos tickets in a more predictable manner
Declined	nshahquinn-wmf	T247103 wmfdata's Kerberos check should require at least 8 hours of validity

Event Timeline

nshahquinn-wmf created this task.Feb 21 2020, 11:51 PM

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf added subtasks: T245097: Update wmfdata to use sensible Spark settings , T245713: wmfdata cannot recover from a crashed Spark session.

nshahquinn-wmf renamed this task from Spark query issues to Analysts cannot reliably use Spark to run SQL queries against Hive databases.Feb 22 2020, 12:05 AM

nshahquinn-wmf claimed this task.

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf added a subscriber: Ottomata.

nshahquinn-wmf added a subscriber: elukey.

nshahquinn-wmf edited projects, added Epic; removed Tracking-Neverending.Feb 22 2020, 12:14 AM

nshahquinn-wmf updated the task description. (Show Details)

@kzimmerman I'm making Phab reflect reality as this is having significant impacts on our work and I've been investigating it extensively for the past couple of weeks. Please let me know if we should discuss deprioritizing this.

I'm also trying to get a handle on the various issues that are happening because they don't manifest predictably, so I'll be refining the breakdown here as I gain a better understanding.

nshahquinn-wmf moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.Feb 22 2020, 12:29 AM

Ottomata renamed this task from Analysts cannot reliably use Spark to run SQL queries against Hive databases to Analysts cannot reliably use Sparker in Jupyter to run SQL queries against Hive databases.Feb 24 2020, 4:50 PM

@Ottomata, is there a reason you think this is specific to Jupyter? Based on what I've seen, it probably has nothing to do with Jupyter and would manifest the same using PySpark in some other environment.

I suspect T245892: Spark application UI shows data for different application and T245713: wmfdata cannot recover from a crashed Spark session are related to Jupyter, you are right though, the others are more general.

LGoto reassigned this task from nshahquinn-wmf to kzimmerman.Feb 24 2020, 8:53 PM

nshahquinn-wmf renamed this task from Analysts cannot reliably use Sparker in Jupyter to run SQL queries against Hive databases to Analysts cannot reliably use Spark to run SQL queries against Hive databases.Feb 25 2020, 1:35 AM

nshahquinn-wmf renamed this task from Analysts cannot reliably use Spark to run SQL queries against Hive databases to Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Feb 25 2020, 5:42 PM

• Nuria closed subtask T245897: Give clear recommendations for Spark settings as Resolved.Feb 28 2020, 12:17 AM

Milimetric closed subtask T245892: Spark application UI shows data for different application as Declined.Mar 2 2020, 5:07 PM

monitoring this for any additional subtasks

Thanks @Milimetric!

Reassigning to @nshahquinn-wmf, who's continuing work on this. I believe he's also going to reach out to Joseph to review wmfdata together, to identify or resolve other issues as needed.

It will be worth thinking whether the goal of having 1 library (wmfdata) that is a fit for all use cases is an suitable one to have. Given the disparity of data sizes and complexities (example: traffic data is large but simple, edit data is small but complex) I am not sure the library path is the most efficient one. Maybe a cookbook approach using different jupyter notebooks with different technologies is a better one, so you have cookbook recipes (in the form of preexisting notebooks) that somebody can extend/copy/modify.

A "catalog" of recipes of jupyter notebooks if you may.

nshahquinn-wmf closed subtask T245097: Update wmfdata to use sensible Spark settings as Resolved.Mar 4 2020, 5:20 PM

nshahquinn-wmf added a project: Wmfdata-Python.Mar 11 2020, 7:28 AM

nshahquinn-wmf closed subtask T246060: Update wmfdata to support multiple SQL engines for Hive databases as Resolved.Mar 13 2020, 6:09 PM

Now that wmfdata 1.0 has been released with support for sensible Spark settings (T245097) and running SQL using the Hive CLI (T246060), I think the overall problem has been solved. Thanks to everyone for pitching in to help!