Page MenuHomePhabricator

Standardize datetimes/timestamps in the Data Lake
Open, NormalPublic

Description

It would be amazing if all Data Lake tables used a consistent format for their time columns. The first step is picking a standard.

Datetime or timestamp?

EventBus, EventLogging, and call it a datetime (dt), but mediawiki_history calls it a timestamp (xxxxxx_timestamp). webrequest, for some reason, has both a ts and a dt field with different formats.

Which format?

There are actually five different formats in use!

formattables usingcomment
YYYY-mm-ddTHH:MM:SSZEventLogging tables ( e.g. event.editattemptstep and event.editorjourney), mediawiki_wikitext_historyStandard ISO 8601 with one form of timezone designation
YYYY-mm-ddTHH:MM:SS+00:00EventBus tables (e.g. event.mediawiki_revision_create and event.mediawiki_page_create)Standard ISO 8601 with another form of timezone designation
YYYY-mm-dd HH:MM:SS.0mediawiki_history and siblingsHive/JDBC standard, which makes it easy to use with Hive date and time functions— alternatives can be configured starting with Hive 1.2.0. Nicely human readable.
unix epoch secondswebrequest's ts fieldThis is a Hive timestamp type, which Hive will display as YYYY-mm-dd HH:MM:SS.0 when you select it.
YYYY-mm-ddTHH:MM:SSwebrequest's dt fieldStandard ISO 8601 without time zone info
Unix timestampCirrusSearchRequestSet

Event Timeline

Milimetric triaged this task as High priority.Jan 3 2019, 6:19 PM
Milimetric moved this task from Incoming to Data Quality on the Analytics board.

I going to boldly broaden this into a plea to fully standardize datetimes/timestamps in the Data Lake 🙏 There are actually two more formats that Morten didn't mention! I'll update the description.

Neil_P._Quinn_WMF renamed this task from Columns named "dt" in the Data Lake have different formats to Standardize datetimes/timestamps in the Data Lake.Jan 18 2019, 1:59 AM
Neil_P._Quinn_WMF updated the task description. (Show Details)
Neil_P._Quinn_WMF updated the task description. (Show Details)

Thanks Neil!

The general convention we've been trying to move forward to over the years is:

  • .*dt fields are human readable ISO-8601 (ish) fields. More on this below
  • .*ts fields are unix epoch timestamps

With a strong preference for .*dt fields everywhere if possible. There may be reasons to use unix timestamps over datetime strings sometimes (performance?) but most of the time dt is better.

We don't have a convention to differentiate between seconds vs. milliseconds unix timestamp field names. Perhaps we should make one. ts vs ts_ms?

Standard ISO 8601 with one form of timezone designation [Z]
Standard ISO 8601 with another form of timezone designation [+00:00]

These both allowed in ISO-8601.

An offset of zero, in addition to having the special representation "Z", can also be stated numerically as "+00:00", "+0000", or "+00".

I prefer the Z, since it is shorter and we use UTC everywhere anyway, but it hasn't been a requirement. The EventBus tables come from mediawiki...I'll make a ticket to use Z here.

Standard ISO 8601 without time zone info

Yeah you are right, this is wrong. I think this may be very easy to fix. We should just add the Z here. Will make a ticket.

Hive/JDBC standard, which makes it easy to use with Hive date and time functions— alternatives can be configured starting with Hive 1.2.0. Nicely human readable.

Yea, it is unfortunate that we don't have a recent Hive. We talked about this when creating MW history, and decided that we'd go ahead and use the Hive format instead of ISO-8601 so we could more easily use Hive datetime functions. This format is close enough to ISO-8601 (as long as milliseconds aren't used) to be parseable by tools that can parse ISO-8601 (e.g. python dateutil.parser). Ideally this would be fixed by upgrading Hive and setting the timestamp format on each table, so we can use the ISO-8601 standard everywhere.

So, if we fix EventBus events and webrequest, the only offender (CirrusSearchRequestSet is moving to EventBus) will be mediawiki history stuff. I suggest we just keep that as is and deal with it, at least until we get a more recent version of Hive.

Neil_P._Quinn_WMF added a comment.EditedMar 6 2019, 8:00 PM

So, if we fix EventBus events and webrequest, the only offender (CirrusSearchRequestSet is moving to EventBus) will be mediawiki history stuff. I suggest we just keep that as is and deal with it, at least until we get a more recent version of Hive.

Thanks for chiming in, @Ottomata! Two thoughts:

First, about the choice of format: My main desire is to see this standardized, so I think picking YYYY-mm-ddTHH:MM:SSZ is reasonable and I would be happy with that outcome. However, can I just say I'd be even more happy with standardizing on YYYY-mm-dd HH:MM:SS? It's not strict ISO 8601, but it's close, unambiguous, and a lot more human readable (which is an important consideration for me and probably other analysts since I do a lot of data exploration).

I was reluctant to say this at first, since it seemed like pointless quibbling. But after thinking about it, I really do thing this would be a increase in ergonomics that would bring a meaningful benefit (if a diffuse one) to analysts, product managers, community members, and others. Are you at all open to this argument? If not, I'm willing to leave it here and enjoy the benefits of just having a standard. But if you are, I'd really like to discuss the costs and benefits further.

Second, I'm inclined to say that, whatever format we choose, we should standardize on it as soon as possible. Moving mediawiki_history to pure ISO-8601 would make it harder to use the datetime functions, but it's already not possible to use those on the other tables we have, so I'd rather have a standard annoyance everywhere 😁

I'd be even more happy with standardizing on YYYY-mm-dd HH:MM:SS

I'd like to hear from @JAllemandou (if he has an opinion!).

But, I think we are a bit too far in for this. Everything except for these two Hive tables is using ISO-8601, and the only reason the Hive tables aren't is because they are Java based and Hive hasn't supported non Java timestamp formats in our versions yet. If we ever upgrade, we'd want to make Hive timestamps use ISO-8601. I also like the fact that we are going with an ISO standard, and not a Java specific one.

unambiguous

While we do use UTC timezone in all of our time stamps, it is not impossible that we might have to parse ones with timezones one day. The ISO standard makes parsing these unambiguous, while it's not totally clear to me how timezones would be represented in the default Java format.

a lot more human readable

While having the space does make the datetime slightly more human readable, it also IMO makes it slightly less easy to parse. Log lines are often whitespace delimited, and having to add special logic to indicate that the dt is a single field with a space in it is a little annoying (I'm thinking of sed | awk type CLI pipelines).

BTW, I just looked at webrequest ts and mediawiki_history e.g. event_timestamp. webrequest's ts is actually a Hive timestamp type, and the comment there (about unix milliseconds) seems misleading. event_timestamp is a string. @JAllemandou remind me why we aren't using a Hive timestamp in MW History?

Change 495399 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/core@master] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/495399

mforns lowered the priority of this task from High to Normal.Mar 11 2019, 3:32 PM

I have an opinion and some information here.
We use string-encoded timestamps in mediawiki-history because our version of hive doesn't support timestamps in parquet (see https://issues.apache.org/jira/browse/HIVE-6384).
The reason we chose the SQL format instead of the ISO one is because Hive UDF work as-is for the former, not the latter:

select dt from event.navigationtiming where year = 2019 and month = 3 and day = 11 and hour = 0 limit 1;
2019-03-11T00:10:45Z

select unix_timestamp(dt) from navigationtiming where year = 2019 and month = 3 and day = 11 and hour = 0 limit 1;
NULL

select event_timestamp from wmf.mediawiki_history where snapshot = '2019-02' limit 1;
2011-08-08 18:58:35.0

select unix_timestamp(event_timestamp) from wmf.mediawiki_history where snapshot = '2019-02' limit 1;
1312829915
Nuria added a subscriber: Nuria.Mar 11 2019, 11:32 PM

While a common standard makes sense, I really see a lot of value in being able to use hive udfs in mw history so I defer to @JAllemandou for what format is best, in any case let's please not change anything additional in mw history until we are done with our current run of changes for scooping data and next the addition of tags, one change at a time.

@JAllemandou, my understanding is that if we get a newer version of Hive, we will be able to use Hive timestamp types with ISO-8601 string formats. If that's correct, I think we should keep ISO-8601 string as our convention, and then use Hive timestamps with them after we (one day) upgrade.

@Ottomata : I don't think Hive timestamp-functions will parse ISO by default. All functions doc gives examples without the 'T'.
Also, the above test with hive is also true in Spark.
When we upgrade to newer hive, we will be able to use hive timestamps, meaning we won't have strings anymore, and therefore format will not be that important because transformation will be done once at refine stage. For now, using format without 'T' facilitates usage a lot IMO.

Hm, no I mean once we get Hive 1.2.0+, we can do

SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ssXXX");

https://issues.apache.org/jira/browse/HIVE-9298

On all tables, which I believe should allow use to use Hive timestamp types with ISO-8601 string fields in Parquet.

Nice! I don't see flaws in this approach @Ottomata, but before testing I don't want to say it'll work for all cases (timestamps work for us in hive, but not in parquet for instance).
Now how we move forward depending on the upgrade time is another thing.

Change 495399 merged by Ottomata:
[mediawiki/core@master] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/495399

@Neil_P._Quinn_WMF, I discussed this with Joseph a bit more today and realized that even though what I said above is true, it might not be exactly what we want. I didn't realize that the Hive/Java format fields you mentioned in mediawiki_history were actually 'timestamp' fields, not 'datetime' ones. While we prefer string datetime fields in ISO-8601 format for both human and machine readability and consistency, there isn't a hard requirement to string datetimes over integer unix epoch timestamps. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines#Schema_set_up says

If you must use an integer unix epoch timestamp, send it in millisecond precision in UTC +00:00 timezone and name the field suffixed with 'ts', e.g. session_start_ts

There are good performances reasons to use integers over strings, and in case of mediawiki_history and Hive, that is what is happening. At the storage level, the Hive timestamp is stored as an integer in Parquet. Hive is just displaying it to you in Hive's preferred Java format. The fields there are named _timestamp, which while not strictly a _ts suffixed field), seems ok to me. If we could go back and change those field names to *_ts, I would, but I would not consider that a priority.

Change 496197 had a related patch set uploaded (by Addshore; owner: Ottomata):
[mediawiki/core@wmf/1.33.0-wmf.21] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/496197

Change 496197 merged by Addshore:
[mediawiki/core@wmf/1.33.0-wmf.21] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/496197

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:36:32Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/includes/api/ApiMain.php: SWAT: T214080 T212529 ApiMain.php api/request logging event changes [[gerrit:496197]] (duration: 00m 57s)

I actually asked my team about their opinions on these two formats today (without telling them which one I liked), and I've included what I learned in these comments.

I'd be even more happy with standardizing on YYYY-mm-dd HH:MM:SS

But, I think we are a bit too far in for this. Everything except for these two Hive tables is using ISO-8601, and the only reason the Hive tables aren't is because they are Java based and Hive hasn't supported non Java timestamp formats in our versions yet.

Absolutely makes sense: stability should be a major priority.

I also like the fact that we are going with an ISO standard, and not a Java specific one.

Well, as @JAllemandou mentioned, this format is actually the ANSI SQL standard, so it's not just a Java specific thing. It's also the default string representation in Python and in Pandas.

unambiguous

While we do use UTC timezone in all of our time stamps, it is not impossible that we might have to parse ones with timezones one day. The ISO standard makes parsing these unambiguous, while it's not totally clear to me how timezones would be represented in the default Java format.

@mpopov mentioned that being able to handle timezoned datetimes would be useful: for example, we might want to store timestamps in the user's timezone so we can analyze activity relative to the standard daily cycle.

I did check the SQL standard, and it specifies that timezones are stored as the number UTC offset just like in ISO 8601, except that it doesn't mention Z as a synonym for +00:00. But as far as I can tell, most databases don't support timezoned datetimes (or, as the SQL standard calls them TIMESTAMP WITH TIMEZONE), and certainly Hive doesn't even in the most recent version. So I'm not sure how practical of a consideration this is.

a lot more human readable

While having the space does make the datetime slightly more human readable

I was going to insist on "a lot more", but no one else on the team mentioned readability, so "slightly more" is probably more accurate 😛

it also IMO makes it slightly less easy to parse. Log lines are often whitespace delimited, and having to add special logic to indicate that the dt is a single field with a space in it is a little annoying (I'm thinking of sed | awk type CLI pipelines).

Also seems fair.

I have an opinion and some information here.
We use string-encoded timestamps in mediawiki-history because our version of hive doesn't support timestamps in parquet (see https://issues.apache.org/jira/browse/HIVE-6384).
The reason we chose the SQL format instead of the ISO one is because Hive UDF work as-is for the former:

@nettrom_WMF mentioned being able to use UDFs as key consideration, and I think a couple other people seconded that.

There are good performances reasons to use integers over strings, and in case of mediawiki_history and Hive, that is what is happening. At the storage level, the Hive timestamp is stored as an integer in Parquet. Hive is just displaying it to you in Hive's preferred Java format.

Are you sure this is actually the case? Hive has supported the TIMESTAMP type since 0.8 (and can map that onto a number representing a Unix timestamp), but describe wmf.mediawiki_history just identifies the field as a STRING.

But if this is actually more performant, why don't we use integer Unix timestamps everywhere? That would be a fine way to standardize.

Are you sure this is actually the case

Oh I guess not! @JAllemandou I thought we were using Hive timestamps?

Hi Folks, I'll tr to provide more infoon Hive Timestamps and related formats.
There are (at least) two considerations when dealing with Hive datasets: the metastore (schema handler) and the file format (how data is actually stored/retrieved). In classical datastores, there is no such distinction, and therefore Timestamp type being available means usable. In hive, the Timestamp type is available in the metastore and in default-supported file formats, but with the version of Hive we have there is a conversion problem with Parquet (https://issues.apache.org/jira/browse/HIVE-9482). This is the reason why we use String and not Timestamp in Mediawiki-history.

Hm. But, also because the Hive/ANSI SQL (thanks for correction Neil, didn't realize that :) ) format works with the time-based functions, right? If not, then could we just switch the mediawiki_history event_timestamp etc. to event_dt ISO-8601 now? (I'm not saying we should!)

Absolutely right @Ottomata - We wanted to use Timestamps type to facilitate applying functions, but it was not feasible because of Parquet, so we went for the string format that allowed those functions.

In https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/497604/, Joal and I were considering the use of the '_timestamp' prefix on Hive timestamp fields. The fields here are ANSI SQL strings, but we'd prefer them to be Hive timestamp types once we upgrade Hive. For the moment, let's assume that they are Hive timestamps.

We have the convention that in event data, unix epoch timestamps use the ts convention:

If you must use an integer unix epoch timestamp, send it in millisecond precision in UTC +00:00 timezone and name the field suffixed with 'ts', e.g. session_start_ts

(Huh, I just realized that we have have 'millisecond' precision in the convention there! I'm not sure if we want that! Perhaps _ts should just be seconds, and _ts_ms should be milliseconds?).

My questsion is: should we also use the _ts convention for 'timestamp' native types, like Hive timestamps? My understanding is that these types are actually stored as integers anyway, so I'd lean towards yes. What do yall think?

@Ottomata, @JAllemandou it sounds like the long-term vision you have is using ISO strings for EventLogging and similar data streams and Unix timestamps for "Hive-native" datasets like mediawiki_history.

Why is that? If Unix timestamps provide easier access to datetime functions or higher performance, shouldn't we also use them when we ingest the EventLogging data in Hive? From an analyst's perspective, this is exactly the kind of difference that I want to get rid of through standardization!

My point is that, if we do maintain these two separate data types with different syntax, yes, it makes sense to have a standard way to distinguish them, like _dt and _ts. But I think it would be even better just to compromise on one and eventually change all of our datasets to use it.

it sounds like the long-term vision you have is using ISO strings for EventLogging and similar data streams and Unix timestamps for "Hive-native" datasets

Not quite!

We prefer the ISO-8601 strings for serialization everywhere. With a future version of Hive, we shhould be able to use the hive timestamp type with the serialization of the field being an ISO-08601 string datetime. The difference is that Hive interprets theses for you. Hive is just a SQL mapping on top of data, and using timestamp types gets us some extra Hive features that we don't have with just strings. So the ideal for us is: ISO-8601 serialization everywhere we can. Even if we can use Hive timestamp with ISO-8601 strings, I still think Hive will display these to you as ANSI SQL strings in its CLI query results. This ticket is about the preferred serialization format. The framework you use to query the fields can do whatever it wants (e.g. parsing ISO-8601 into python datetime type).

My questsion is: should we also use the _ts convention for 'timestamp' native types, like Hive timestamps? My understanding is that these types are actually stored as integers anyway, so I'd lean towards yes. What do yall think?

BTW, my understanding is limited here, I've not actually checked this. I think this is what our current version of Hive does. A newer version should allow us to use ISO-8601 strings serialization even with the Hive timestamp type. Reading this again now, my question doesn't make any sense. It would only make sense if we were trying to standardize on unix epoch integer serialization, but we aren't.

We prefer the ISO-8601 strings for serialization everywhere.

I want to point the serialization described here. When we use text formats (JSON for instance, as this is the one we chose), we want to serialize in ISO-8601. When using parquet however (assuming hive-parquet timestamp issue is behind), I'd rather use unix epoch integer serialization through timestamp type, as it would prevent the machines to spend cycles converting strings to timestamps at every request.