Page MenuHomePhabricator

Create Spark code to compare DateTimes with partition columns
Closed, ResolvedPublic

Description

It would be cool to have an easy way to slice a Hive table, which is partitioned by year, month, day and hour, by using DateTime values.
Right now, the only way (that I know of) to do that is to have a complex condition with lots of ANDs and ORs, that depends on the number of existing partitionKeys:

def getTimePartitionPruningCondition(
    since: DateTime,
    until: DateTime,
    partitionKeys: Seq[String]
): String = {
    if (partitionKeys.contains("hour")) { s"""(
        year > ${since.year.get} AND year < ${until.year.get} OR
        year = ${since.year.get} AND (
            month > ${since.monthOfYear.get} OR
            month = ${since.monthOfYear.get} AND (
                day > ${since.dayOfMonth.get} OR
                day = ${since.dayOfMonth.get} AND hour >= ${since.hourOfDay.get}
            )
        ) OR
        year = ${until.year.get} AND (
            month < ${until.monthOfYear.get} OR
            month = ${until.monthOfYear.get} AND (
                day < ${until.dayOfMonth.get} OR
                day = ${until.dayOfMonth.get} AND hour <= ${until.hourOfDay.get}
            )
        )
    )""" }
    else if (partitionKeys.contains("day")) { s"""(
        year > ${since.year.get} AND year < ${until.year.get} OR
        year = ${since.year.get} AND (
            month > ${since.monthOfYear.get} OR
            month = ${since.monthOfYear.get} AND day >= ${since.dayOfMonth.get}) OR
        year = ${until.year.get} AND (
            month < ${until.monthOfYear.get} OR
            month = ${until.monthOfYear.get} AND day <= ${until.dayOfMonth.get})
    )""" }
    else if (partitionKeys.contains("month")) { s"""(
        year > ${since.year.get} AND year < ${until.year.get} OR
        year = ${since.year.get} AND month >= ${since.monthOfYear.get} OR
        year = ${until.year.get} AND month <= ${until.monthOfYear.get}
    )""" }
    else { s"year >= ${since.year.get} AND year <= ${until.year.get}" }
}

May also be able to make a helper for Hive to create a string you inject into your WHERE clause: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

Event Timeline

Note that the example code considers both since and until DateTimes inclusive, we might want to consider whether until should be exclusive?

Milimetric renamed this task from Create Hive UDF to compare DateTimes with partition columns to Create Spark code to compare DateTimes with partition columns.Jan 3 2019, 6:26 PM
Milimetric triaged this task as Medium priority.Jan 3 2019, 6:33 PM
Milimetric moved this task from Incoming to Data Quality on the Analytics board.

Joseph was right, you can't substitute a string in as a WHERE clause condition, only as a value to compare to within the WHERE clause. The parser gets an error:

hive (wmf)> select * from wmf.pageview_hourly where ${hiveconf:y} and year=2018 limit 1;
FAILED: ClassCastException org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantStringObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector

Change 673604 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] [WIP] Add support for finding RefineTarget inputs from Hive

https://gerrit.wikimedia.org/r/673604

Change 673604 merged by Joal:

[analytics/refinery/source@master] Add support for finding RefineTarget inputs from Hive

https://gerrit.wikimedia.org/r/673604