Page MenuHomePhabricator

[Metrics Platform] Standardize Timestamps across all datetime fields
Open, MediumPublic

Description

Per our discussion with Product Analytics, Andrew Otto, Jason Linehan, and Michael Holloway in the Product Analytics team sharing meeting, we have agreed to move forward with ms precision for all datetime fields.

Events should be invalid if the timestamp is not in ms granularity.

Any datetime fields that require a different level of granularity should be considered a different field type and not named dt.

Event Timeline

kzimmerman moved this task from Inbox to To Do on the Better Use Of Data board.

@DAbad wanted to flag this discussion we had about timestamps. Product Analytics feels strongly that we should standardize datetime fields going forward (and codify those standards), and we agreed with Andrew/Jason/Michael that we should standardize on ms granularity.

Events should be invalid if the timestamp is not in ms granularity.

I looked into if this would be easy to do with just JSONSchema, and it isn't. JSONSchema follows https://tools.ietf.org/html/rfc3339#section-5.6, in which the time second-fraction is optional. We'd have to either A. define a custom validation keyword inside of all of EventGate (which uses nodejs ajv), and also in any other validators (in clients that don't use EventGate), OR we'd have to always use a custom format regex for all date time fields. Perhaps we could automate this with jsonschema-tools fragments somehow? Not sure.

Any datetime fields that require a different level of granularity should be considered a different field type and not named dt.

Hm, I'm not so sure about this strict rule, let's discuss more. If anything, I'd do it the other way around: make a new naming convention for fields that must be ISO-8601 with milliseconds. Maybe dt_ms? Although, then our existent fields would violate the convention. I dunno! I think this convention will be hard to enforce everywhere, especially since we already have some dt fields without milliseconds.

Perhaps just making clients emit milliseconds everywhere possible is an easier way to go?

Events should be invalid if the timestamp is not in ms granularity.

I looked into if this would be easy to do with just JSONSchema, and it isn't. JSONSchema follows https://tools.ietf.org/html/rfc3339#section-5.6, in which the time second-fraction is optional. We'd have to either A. define a custom validation keyword inside of all of EventGate (which uses nodejs ajv), and also in any other validators (in clients that don't use EventGate), OR we'd have to always use a custom format regex for all date time fields. Perhaps we could automate this with jsonschema-tools fragments somehow? Not sure.

Any datetime fields that require a different level of granularity should be considered a different field type and not named dt.

Hm, I'm not so sure about this strict rule, let's discuss more. If anything, I'd do it the other way around: make a new naming convention for fields that must be ISO-8601 with milliseconds. Maybe dt_ms? Although, then our existent fields would violate the convention. I dunno! I think this convention will be hard to enforce everywhere, especially since we already have some dt fields without milliseconds.

Perhaps just making clients emit milliseconds everywhere possible is an easier way to go?

I would say that ideally we would have one datetime stamp field regardless of whether milliseconds are populated or not. In the past I've run into similar issues where we decided to spin out multiple datetime stamp fields, which lead to not having a single field that consistently could be used to pull datetime. This led to creating a 3rd derived datetime field that captured the values from both of the datetime stamps.

Honestly if we go with ISO standards and milliseconds is optional, we can do our best to encourage/push population of milliseconds where feasible, but shouldn't make it a hard requirement.

I would say that ideally we would have one datetime stamp field

In general I agree, but we have a small problem with that. When we accept data from external sources, clients can POST any data they want and as long as it conforms to the schema, it will be accepted. This means that they can set any dt value they want too. We can't trust and use externally provided dts for things like Kafka timestamps and Hive directory partitioning. We actually had a bug in the past where a client had set a dt years in the future. Since Kafka only deletes data that is older than 7 days, this caused any messages produced after that one to be kept basically forever (or until the present would have progressed 7 days passed the offending dt), which caused Kafka disks to fill up. We mitigated this by rejecting timestamps too far in the future.

Ideally we'd trust the client's event timestamp, as they know best what the proper time for an event's happening is. We do trust client dt for internally produced events. But for external ones, we can't trust it for partitioning, so we need another dt for that. We use the event ingetsion (EventGate receive) time for that.

See also T267648: Adopt conventions for server receive and client/event timestamps in non analytics event schemas and T240460: Clients need to generate an ISO 8601 formatted timestamp.

So we have two time_stamps: Server_dt and Client_dt. We should validate using the Server_dt to validate if the client_dt is off and then determine how to handle it...

After speaking with Jason and Michael today, we likely don't need milliseconds and if we don't have fractional sections we will interpret as a round number.

After speaking with Jason and Michael today, we likely don't need milliseconds and if we don't have fractional sections we will interpret as a round number.

Hm, you know, EventGate could add add the milliseconds fraction if not provided by the client!

So we have two time_stamps: Server_dt and Client_dt. We should validate using the Server_dt to validate if the client_dt is off and then determine how to handle it...

Agree! That's what T267648: Adopt conventions for server receive and client/event timestamps in non analytics event schemas is about, although with different names. I agree the proposed names are a little confusing...but changing them now is quite a bit of work. Actually, changing dt (client side) name would not be so hard, changing meta.dt is a bit harder.

So we have two time_stamps: Server_dt and Client_dt. We should validate using the Server_dt to validate if the client_dt is off and then determine how to handle it...

After speaking with Jason and Michael today, we likely don't need milliseconds and if we don't have fractional sections we will interpret as a round number.

Does this mean "we don't need to require milliseconds" (I agree) or "there's no benefit to adding milliseconds" (my team came to a different conclusion)?

After speaking with Jason and Michael today, we likely don't need milliseconds and if we don't have fractional sections we will interpret as a round number.

Does this mean "we don't need to require milliseconds" (I agree) or "there's no benefit to adding milliseconds" (my team came to a different conclusion)?

The former.