Page MenuHomePhabricator

Add Event Platform timestamp JSONSchema -> Flink type support
Closed, ResolvedPublic

Description

This change implements conversions from Event Platform event JSONSchemas to Flink types in both the Table and DataStream APIs.

That change did not implement any conversion from string type to date-time timestamps.
JSONSchema represents this with format: date-time.

JSONSchema date-time format supports timezone-full timestamps, and Event Platform specifices that we prefer these date-times in UTC 'Z' timezone format, e.g. "2022-05-01T00:00:00Z". JSONSchema date-time will also validate with timezone offsets e.g. "2022-05-01T00:00:00-05:00".

As far as I can tell, Flink does not really support string date-times with timezone info. It suppports timezone-less, or local-timezone, the semantics differeing only in that local-timezone date-times are stored as UTC timestamps and presented in local time depending on the Flink table.local-time-zone setting.

Our event data hopefully will have all date-times in 'Z' UTC format, and using local-timezone Flink timestamps will usually be the right thing to do. However, it is possible, especially in client side submitted instrumentation event data, for date-times to come in with timezone offsets. If we always convert date-time to Flink local-timezone timestamps, these will fail conversion.

I'm not sure of the right thing to do here. We could just keep JSONSchema date-time fields as strings and let users in flink deal with conversion to timestamp types where needed. It would be nice if this was automated though.

Related: T278467: Use Hive/Spark timestamps in Refined event data

Event Timeline

However, it is possible, especially in client side submitted instrumentation event data, for date-times to come in with timezone offsets. If we always convert date-time to Flink local-timezone timestamps, these will fail conversion.

Perhaps the right thing to do is to use 'local timezone' (UTC) for all date-times, and in the case where an incoming JSONSchema date-time has timezone offset, compute back to the UTC time for it and store it as UTC. We lose the incoming timezone info, but as long as the UTC time is correct, perhaps that is fine?

Change 819726 had a related patch set uploaded (by Ottomata; author: Ottomata):

[wikimedia-event-utilities@master] Support JSONSchema Flink timestamp conversions

https://gerrit.wikimedia.org/r/819726

Change 819726 merged by jenkins-bot:

[wikimedia-event-utilities@master] Support JSONSchema Flink timestamp conversions

https://gerrit.wikimedia.org/r/819726