Page MenuHomePhabricator

Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
Closed, ResolvedPublic

Description

Due to introduction of Kerberos authentication, wmfdata switched to using PySpark to run SQL queries against Hive databases.

However, we are encountering some significant issues with Spark as a backend; this task tracks those issues.

Event Timeline

nshahquinn-wmf renamed this task from Spark query issues to Analysts cannot reliably use Spark to run SQL queries against Hive databases.Feb 22 2020, 12:05 AM
nshahquinn-wmf claimed this task.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf added a subscriber: Ottomata.
nshahquinn-wmf added a subscriber: elukey.
nshahquinn-wmf added a subscriber: kzimmerman.

@kzimmerman I'm making Phab reflect reality as this is having significant impacts on our work and I've been investigating it extensively for the past couple of weeks. Please let me know if we should discuss deprioritizing this.

I'm also trying to get a handle on the various issues that are happening because they don't manifest predictably, so I'll be refining the breakdown here as I gain a better understanding.

Ottomata renamed this task from Analysts cannot reliably use Spark to run SQL queries against Hive databases to Analysts cannot reliably use Sparker in Jupyter to run SQL queries against Hive databases.Feb 24 2020, 4:50 PM

@Ottomata, is there a reason you think this is specific to Jupyter? Based on what I've seen, it probably has nothing to do with Jupyter and would manifest the same using PySpark in some other environment.

nshahquinn-wmf renamed this task from Analysts cannot reliably use Sparker in Jupyter to run SQL queries against Hive databases to Analysts cannot reliably use Spark to run SQL queries against Hive databases.Feb 25 2020, 1:35 AM
nshahquinn-wmf renamed this task from Analysts cannot reliably use Spark to run SQL queries against Hive databases to Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Feb 25 2020, 5:42 PM
Milimetric moved this task from Incoming to Radar on the Analytics board.
Milimetric subscribed.

monitoring this for any additional subtasks

Thanks @Milimetric!

Reassigning to @nshahquinn-wmf, who's continuing work on this. I believe he's also going to reach out to Joseph to review wmfdata together, to identify or resolve other issues as needed.

It will be worth thinking whether the goal of having 1 library (wmfdata) that is a fit for all use cases is an suitable one to have. Given the disparity of data sizes and complexities (example: traffic data is large but simple, edit data is small but complex) I am not sure the library path is the most efficient one. Maybe a cookbook approach using different jupyter notebooks with different technologies is a better one, so you have cookbook recipes (in the form of preexisting notebooks) that somebody can extend/copy/modify.

A "catalog" of recipes of jupyter notebooks if you may.

Now that wmfdata 1.0 has been released with support for sensible Spark settings (T245097) and running SQL using the Hive CLI (T246060), I think the overall problem has been solved. Thanks to everyone for pitching in to help!