Page MenuHomePhabricator

Refactor the HivePartition logic from mjolnir to discolytics.
Closed, ResolvedPublic5 Estimated Story Points

Description

Mjolnir CLI helpers provide the HivePartition class for managing datasets. Additionally, they include helper methods for filtering input data based on date and time partitions.

This logic should be refactored into discolytics.

Event Timeline

Gehel set the point value for this task to 5.Apr 14 2025, 3:47 PM

A few extra complications:

  • mjolnir has bits that take both the loaded dataframe and a schema, and makes sure they are compatible. That might be useful to bring into discolytics as an optional extension? It helps to document the shape of the data that is expected by the script.
  • mjolnir reads a table that uses a User Defined Type (VectorUDT), these cannot be read from hive. Instead we use DataFrame.inputFiles to re-read the parquet files directly. That likely needs to stay on the mjolnir side, we've never seen another use case for the UDT's.
  • mjolnir has some code in mjolnir.cli.helpers.Cli._resolve_partition_spec that magic's up the partitioning information used in the output tables. Will need to decide what to do with this.
  • mjolnir doesn't write through hive directly, rather it writes parquet files and then adds partitions in hive that point at those parquet files. Not sure if this functionality required.

This has now been fully deployed, but we will want to make sure it runs to completion this week. It typically starts thursday at 00:00 and runs for 20-30h.