Page MenuHomePhabricator

Use Hive/Spark timestamps in Refined event data
Open, LowPublic

Description

Now that we've upgraded Hive, we can use actual timestamp types!

To do this, I think we need:

  • ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss,millis")

and/or

  • Changes to JsonSchemaConverter to use TimestampType for date-time formatted JSON string fields.

I'm not exactly yet sure how this would work for existing tables. Needs some testing to find out.

Event Timeline

fdans moved this task from Incoming to Datasets on the Analytics board.

@Ottomata do you need something from Research for this task? (@fkaelin cc) I'm asking as we're reviewing tasks in our backlog for prioritization and I'm not sure what the status of this one is.

Nope! I think this was tagged with yall long ago because it would change some basic query semantics of timestamps in Hive tables. We should do it, but I guess its low enough priority?

It'll become higher priority when we need timestamp typed fields for Iceberg :)

OO, maybe we should do this as part of the Iceberg migration then, since we will be creating new tables for that.

Agreed that the Hive tables can stay as they are, and the new Iceberg tables can do proper DATEs and TIMESTAMPs. When inserting into Iceberg, we can cast accordingly. See T335305 for an example conversion from year, month, day INTs to a DATE.

If we migrate Refine to write to Iceberg, we'll need to modify JsonSchemaConverter as noted in the description do the right thing.

Also, maybe we should do T321854: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities at the same time. This will be easier/possible once T337421: Fix wikimedia-event-utilities Guava dependencies issues is done.

Thanks, all. Then I remove this from our todo lane. Please do add Research back if you want us to support you in some direct way. And thank you for your work. :)

If we migrate Refine to write to Iceberg, we'll need to modify JsonSchemaConverter as noted in the description do the right thing.

Also, maybe we should do T321854: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities at the same time. This will be easier/possible once T337421: Fix wikimedia-event-utilities Guava dependencies issues is done.

Sounds cool!

Change 936293 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere

https://gerrit.wikimedia.org/r/936293

Change 936293 merged by jenkins-bot:

[analytics/refinery/source@master] Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere

https://gerrit.wikimedia.org/r/936293