Querying using Hive has been deprecated for almost 10 years. Within the Foundation, Spark and Presto fully cover the use case, and folks generally use those instead.
It's time to deprecate and, eventually, remove this functionality from Wmfdata.
Querying using Hive has been deprecated for almost 10 years. Within the Foundation, Spark and Presto fully cover the use case, and folks generally use those instead.
It's time to deprecate and, eventually, remove this functionality from Wmfdata.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | nshahquinn-wmf | T384541 Deprecate Wmfdata's Hive module | |||
| Resolved | nshahquinn-wmf | T273196 Support importing Pandas datasets into the Data Lake using Wmfdata |
@fkaelin has put up an MR that simply switches run and load_csv to use Spark under the hood.
The main question I have is whether all or virtually all existing queries built for Hive will run without changes on Spark. If that's the case, I think @fkaelin's approach is correct. If instead users should expect to have to tweak some queries for Spark, we should go the more annoying but more proper route:
Two additional things we should do to load_csv while we're touching this code:
I tried several Hive queries from Wikitech and my old notebooks in both Hive and Spark and found two with minor differences in the output. It's really not much, but it's enough to make me think we should take the longer but more careful route:
- leave the existing functionality in place
- deprecate it with a warning for users to manually move
- eventually remove the Hive module in a major version
I've decided that we should deprecate, warn, and then remove rather than just silently replacing the Hive functionality with Spark. This won't add much work, although it will add a lot more waiting.
The main benefits are:
nshahquinn-wmf opened https://gitlab.wikimedia.org/repos/data-engineering/wmfdata-python/-/merge_requests/70
hive: Deprecate the hive module
nshahquinn-wmf merged https://gitlab.wikimedia.org/repos/data-engineering/wmfdata-python/-/merge_requests/70
hive: Deprecate the hive module