Due to introduction of Kerberos authentication, wmfdata switched to using PySpark to run SQL queries against Hive databases.
However, we are encountering some significant issues with Spark as a backend; this task tracks those issues.
Due to introduction of Kerberos authentication, wmfdata switched to using PySpark to run SQL queries against Hive databases.
However, we are encountering some significant issues with Spark as a backend; this task tracks those issues.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | nshahquinn-wmf | T245891 Analysts cannot reliably use wmfdata to run SQL queries against Hive databases | |||
Resolved | nshahquinn-wmf | T245097 Update wmfdata to use sensible Spark settings | |||
Resolved | nshahquinn-wmf | T245713 wmfdata cannot recover from a crashed Spark session | |||
Declined | None | T245892 Spark application UI shows data for different application | |||
Declined | nshahquinn-wmf | T245896 Spark applications crash when running large queries | |||
Resolved | JAllemandou | T245897 Give clear recommendations for Spark settings | |||
Resolved | nshahquinn-wmf | T246060 Update wmfdata to support multiple SQL engines for Hive databases | |||
Resolved | elukey | T246132 Spark sessions can provision kerberos tickets in a more predictable manner | |||
Declined | nshahquinn-wmf | T247103 wmfdata's Kerberos check should require at least 8 hours of validity |
@kzimmerman I'm making Phab reflect reality as this is having significant impacts on our work and I've been investigating it extensively for the past couple of weeks. Please let me know if we should discuss deprioritizing this.
I'm also trying to get a handle on the various issues that are happening because they don't manifest predictably, so I'll be refining the breakdown here as I gain a better understanding.
@Ottomata, is there a reason you think this is specific to Jupyter? Based on what I've seen, it probably has nothing to do with Jupyter and would manifest the same using PySpark in some other environment.
I suspect T245892: Spark application UI shows data for different application and T245713: wmfdata cannot recover from a crashed Spark session are related to Jupyter, you are right though, the others are more general.
Thanks @Milimetric!
Reassigning to @nshahquinn-wmf, who's continuing work on this. I believe he's also going to reach out to Joseph to review wmfdata together, to identify or resolve other issues as needed.
It will be worth thinking whether the goal of having 1 library (wmfdata) that is a fit for all use cases is an suitable one to have. Given the disparity of data sizes and complexities (example: traffic data is large but simple, edit data is small but complex) I am not sure the library path is the most efficient one. Maybe a cookbook approach using different jupyter notebooks with different technologies is a better one, so you have cookbook recipes (in the form of preexisting notebooks) that somebody can extend/copy/modify.
A "catalog" of recipes of jupyter notebooks if you may.