Page MenuHomePhabricator

Deprecate Wmfdata's Hive module
Closed, ResolvedPublic

Description

Querying using Hive has been deprecated for almost 10 years. Within the Foundation, Spark and Presto fully cover the use case, and folks generally use those instead.

It's time to deprecate and, eventually, remove this functionality from Wmfdata.

Event Timeline

nshahquinn-wmf triaged this task as Low priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
nshahquinn-wmf added a subscriber: fkaelin.

@fkaelin has put up an MR that simply switches run and load_csv to use Spark under the hood.

The main question I have is whether all or virtually all existing queries built for Hive will run without changes on Spark. If that's the case, I think @fkaelin's approach is correct. If instead users should expect to have to tweak some queries for Spark, we should go the more annoying but more proper route:

  1. leave the existing functionality in place
  2. deprecate it with a warning for users to manually move
  3. eventually remove the Hive module in a major version
nshahquinn-wmf raised the priority of this task from Low to Medium.Jan 24 2025, 10:53 PM

Two additional things we should do to load_csv while we're touching this code:

  • Remove the ability to create a database in the process (instead, users can create the database manually if it doesn't already exist)
  • Handle the case where the user is uploading a file that already exists (possible solution: upload to a wmfdata-uploads folder in the user's home folder, and overwrite when necessary).

The main question I have is whether all or virtually all existing queries built for Hive will run without changes on Spark.

I tried several Hive queries from Wikitech and my old notebooks in both Hive and Spark and found two with minor differences in the output. It's really not much, but it's enough to make me think we should take the longer but more careful route:

  1. leave the existing functionality in place
  2. deprecate it with a warning for users to manually move
  3. eventually remove the Hive module in a major version

I've decided that we should deprecate, warn, and then remove rather than just silently replacing the Hive functionality with Spark. This won't add much work, although it will add a lot more waiting.

The main benefits are:

  • Avoid potential user confusion (why are my Hive queries suddenly outputting Spark messages? why is this query output slightly different than it used to be?)
  • Reduce the API surface area, which will make the library simpler and easier to learn
nshahquinn-wmf renamed this task from Deprecate the Hive module to Deprecate Wmfdata's Hive module.Aug 16 2025, 12:18 AM
nshahquinn-wmf changed the task status from Open to In Progress.Aug 19 2025, 6:55 AM
nshahquinn-wmf moved this task from Ready for development to Code review on the Wmfdata-Python board.