Page MenuHomePhabricator

Use native timestamp types in Data Lake edit data (needs Hive 1.2)
Closed, ResolvedPublic5 Story Points

Description

Hive has native timestamp types, which work seamlessly with date and time functions. These functions remove the need for a lot of (CPU-intensive) manual casting and math, so it would be nice to use them in place of or in addition to the MediaWiki-style string timestamps.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 7:41 PM
Nuria triaged this task as Normal priority.Mar 27 2017, 3:45 PM
Nuria added a subscriber: Nuria.

Putting this on Q4

Nuria raised the priority of this task from Normal to High.May 29 2017, 3:51 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou set the point value for this task to 5.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Code ready on spark side, but timestamps in Parquet have been added to Hive in 1.2 version and we have 1.1 :( https://issues.apache.org/jira/browse/HIVE-6384).

Nuria closed this task as Resolved.Jul 12 2017, 7:21 PM

Code ready on spark side, but timestamps in Parquet have been added to Hive in 1.2 version and we have 1.1 :( https://issues.apache.org/jira/browse/HIVE-6384).

That's a shame! Do you have a general sense of when we might upgrade to 1.2? 3 months? A year? Never?

Nuria added a subscriber: Ottomata.Aug 7 2017, 3:57 AM

That's a shame! Do you have a general sense of when we might upgrade to 1.2? 3 months? A year? Never?

Cloudera needs to provide a distro that includes this version of hive, I think that release is several months late as I remember @Ottomata saying it was scheduled for early 2017.
most current distro is on hive 1.1 : https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_vd_cdh_package_tarball_512.html#cm_vd_cdh_package_tarball_512

Thank you, good to know! Would it make sense to keep this open and stalled while we're waiting? It would put my mind at ease although of course it's not a big deal.

Neil_P._Quinn_WMF reopened this task as Stalled.Nov 6 2017, 8:29 PM
Neil_P._Quinn_WMF lowered the priority of this task from High to Low.

Thank you, good to know! Would it make sense to keep this open and stalled while we're waiting? It would put my mind at ease although of course it's not a big deal.

Since it seems like no one will mind, I'll keep this open, as a reminder to reconsider it whenever the software support becomes available.

Nuria closed this task as Resolved.Nov 6 2017, 8:40 PM
Nuria reopened this task as Open.Nov 6 2017, 8:47 PM
Neil_P._Quinn_WMF renamed this task from Use native timestamp types in Data Lake edit data to Use native timestamp types in Data Lake edit data (needs Hive 1.2).Nov 9 2017, 11:53 PM
Neil_P._Quinn_WMF removed JAllemandou as the assignee of this task.
Neil_P._Quinn_WMF edited projects, added Analytics; removed Analytics-Kanban.
Neil_P._Quinn_WMF moved this task from Wikistats Production to Blocked on the Analytics board.
Neil_P._Quinn_WMF added a subscriber: JAllemandou.

The latest CDH version is 5.14, which sadly still has Hive 1.1.

Neil_P._Quinn_WMF closed this task as Resolved.May 14 2018, 10:20 PM
Neil_P._Quinn_WMF claimed this task.

This is actually resolved! When I originally filed this, the timestamps were stored in Mediawiki's string format (e.g. 20180514215953), but in June 2017 we started storing in JDBC format (e.g. 2018-05-14 21:59:53.0). This means Hive's date and time functions can now operate on them.

Of course, this is now inconsistent with EventLogging timestamps in the Data Lake, which are stored in ISO 8601 format (e.g. 2018-05-14T21:59:53), but that's another issue 😁

Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.May 14 2018, 10:21 PM