Page MenuHomePhabricator

Update wmfdata to support multiple SQL engines for Hive databases
Closed, ResolvedPublic

Description

Right now, wmfdata.hive.run runs queries in a single way, using a Spark session with the default settings. Based on a meeting with Analytics, it seems it's not possible to recommend an optimal engine for all use cases, so the function should support several (Presto, the Hive CLI, and a couple different bundles of Spark settings).

In my opinion, this should be done by updating wmfdata.hive.run to take an engine parameter, with values like presto, hive-cli, spark, and spark-large, to make it as easy as possible to swap engines. However, I could be convinced that it's better to have separate functions entirely, particularly because Presto may require different functions or syntax in some circumstances.

It's also important to provide a good default for users who don't have the time or expertise to tune; it sounds like that should be the Hive CLI, which provides strong reliability at the cost of some speed and resource use.

Event Timeline

Leaving this unprioritized for now, pending more thinking and discussion about the best way to deal with these issues.

The big pull request is complete: https://github.com/neilpquinn/wmfdata/pull/8!

Assigning to Mikhail since he is reviewing.

Mikhail reviewed and merged the pull request! I'm going to do a little extra cleanup while we're at it, and then we'll make this our first properly versioned release.

We've released version 1.0, with support for running SQL through both Spark (with sensible settings) and the Hive CLI! Presto support is planned for later (T247062).