Page MenuHomePhabricator

Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version
Closed, ResolvedPublic

Description

As documented in https://meta.wikimedia.org/wiki/Schema:EventCapsule , EventLogging tables are currently using timestamps in the YYYYMMDDHHMMSSformat.

However, it appears that the new Hadoop-based EL tables in Hive uses Epoch timestamps instead, at least the one that is being generated via this pilot process T177783: Eventlogging refine popups, temporary cron.[1]
(On the other hand, those EL tables on Hive that were imported from MySQL with Sqoop appear to still use YYYYMMDDHHMMSS, see e.g. the example query at T172322#3551131 .)

I can probably live with this in the present example (tbayer.popups). But in general this will break a great many existing queries. Of course using Epoch has advantages too (e.g. calculating relative time differences becomes much easier), and hopefully the new processes will consistently add partitions for year, month, day and hour, so that these can be used for restricting queries to certain timespans instead. But it still increases the switching costs a lot, so there needs to be an informed decision about this if this in case Epoch is chosen for the general approach.

[1]
SELECT MIN(timestamp) FROM tbayer.popups WHERE year = 2017 AND month = 10 AND day = 31 AND hour = 0;

_c0
1509408000
1 row selected (32.214 seconds)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I just looked into what EventLogging is doing to these timestamps. It's kinda crazy!

The timestamp comes in from clients in the format we prefer, ISO-8601, e.g. 2017-11-02T13:55:37.

EventLogging then parses out this and converts to integer epoch:
https://github.com/wikimedia/eventlogging/blob/master/eventlogging/parse.py#L82-L88

Then again, when it writes to MySQL, it converts to another 'Mediawiki' timestamp:
https://github.com/wikimedia/eventlogging/blob/master/eventlogging/jrm.py#L34

I'm not sure what is best to do here. I'd prefer if EventLogging didn't do this conversion at all. Perhaps we should make EventLogging stop converting the incoming ISO-8601 timestamps to ints, but then modify the MySQL conversion code to know how to convert from ISO-8601 timestamp to Mediawiki format, to maintain backwards compatibility with MySQL tables.

This would have the advantage of standardizing the timestamp format to the format we are using everywhere else (except for the MySQL tables). The disadvantage is that the schema of the EventCapsule would have to be changed to allow for string formatting. This would also break any existing Hive tables already imported with JsonRefine (T162610), as the type of the field would change. We haven't productionized any EventLogging Hive tables yet, so I think this is ok.

This wouldn't address Tilman's problem of having to change queries to suit Hive. I'm not really sure what we can do there though. We should do as little transformation to the data as possible, so we can keep the importing code generic and not special cased. I don't think that adding a Mediawiki timestamp to the data is a good idea.

Interesting findings! Food for thought... we should probably reach out to other users of this data to get more input on the best choice going forward; how about posting to Analytics-l?

One followup question for now:

I just looked into what EventLogging is doing to these timestamps. It's kinda crazy!

The timestamp comes in from clients in the format we prefer, ISO-8601, e.g. 2017-11-02T13:55:37.

My understanding was that the timestamp is generated server-side, presumably on the Varnishes - is that what "clients" refer to here? Or does the timestamp actually record the local time as seen by the user agent (browser)?

Ah, you are correct, it is from the varnishes.

As documented in https://meta.wikimedia.org/wiki/Schema:EventCapsule , EventLogging tables are currently using timestamps in the YYYYMMDDHHMMSSformat.

Tilman, you made the schema change that changed the EventCapsule timestamp field to Mediawiki timestamp format! :o
https://meta.wikimedia.org/w/index.php?title=Schema:EventCapsule&diff=prev&oldid=16479585

This is actually very incorrect. The EventCapsule timestamp is utc-millisec[1]. The code responsible for inserting into MySQL does some special conversion to Mediawiki format. The only place this is used is for the MySQL tables. In fact, if we had updated the EventLogging codebase to use this (or a later) revision of the EventCapsule schema all insertion of events into MySQL would have broken. Currently, EventLogging hardcodes the EventCapsule schema version, and is currently on 15423246.

[1]

[@stat1005:/srv/log/eventlogging/archive] $ zcat /srv/log/eventlogging/archive/all-events.log-20171101.gz | head -n 4 | jq .timestamp
1509431228
1509431228
1509431228
1509431228

As documented in https://meta.wikimedia.org/wiki/Schema:EventCapsule , EventLogging tables are currently using timestamps in the YYYYMMDDHHMMSSformat.

Tilman, you made the schema change that changed the EventCapsule timestamp field to Mediawiki timestamp format! :o
https://meta.wikimedia.org/w/index.php?title=Schema:EventCapsule&diff=prev&oldid=16479585

This is actually very incorrect. The EventCapsule timestamp is utc-millisec[1].

No, these are not in the utc-millisec format, at least according to the documentation I can find: https://www.npmjs.com/package/json-gate : "A number or an integer containing the number of milliseconds that have elapsed since midnight UTC, 1 January 1970." There are no milliseconds in this data, these are integer Epoch seconds only.

The code responsible for inserting into MySQL does some special conversion to Mediawiki format. The only place this is used is for the MySQL tables.

Well, as often, we have different perspectives on this question; mine is that of the users of this data, but I do appreciate the engineers' internal technical viewpoint as well. From the data analyst perspective, this edit was simply a correction to accurately reflect the status quo as it had been presenting itself to the vast majority of users of EL data for several years at this point. (And IIRC I had previously been asking @ori and/or other knowledgeable people about this discrepancy, to no avail.) Because the data obviously does not have millisecond resolution, it did not occur to me that there could be an equivalent but different internal representation.

That's what "documented" refers to in the task description. I can see your point about the need to reflect the internal format too, but changing it back would again make this documentation confusing and misleading for the users of this data, and also reintroduce the factually wrong "millisecond" statement.

In fact, if we had updated the EventLogging codebase to use this (or a later) revision of the EventCapsule schema all insertion of events into MySQL would have broken. Currently, EventLogging hardcodes the EventCapsule schema version, and is currently on 15423246.

Shouldn't it actually be at 16619630, considering that the user agent changes have already been implemented in T153207 ?

[1]

[@stat1005:/srv/log/eventlogging/archive] $ zcat /srv/log/eventlogging/archive/all-events.log-20171101.gz | head -n 4 | jq .timestamp
1509431228
1509431228
1509431228
1509431228

Anyway, this discussion about how to document the format(s) on Meta is a bit offtopic for this task - we should decide and implement the timestamp for the ne Hive-refined tables first, and then think on how/whether to reflect it on the Schema:EventCapsule page. What do you think about the idea to reach out on Analytics-l?

No, these are not in the utc-millisec format

Ya that's true. The schema and code support (and expect) milliseconds. I guess the seconds value just validates with milliseconds too.

we have different perspectives on this question; mine is that of the users of this data

Aye! but MySQL users are not the only user of this data! The performance team folks use it build Grafana dashboards.

Shouldn't it actually be at 16619630, considering that the user agent changes have already been implemented in T153207 ?

I think all the recent EventCapsule schema changes have been documentation only, so there has been no need to update the hardcoded schema version. (BTW, this EventCapsule causes a lot of engineering problems, and I hope we can get rid of it one day)

. I can see your point about the need to reflect the internal format too, but changing it back would again make this documentation confusing and misleading for the users of this data, and also reintroduce the factually wrong "millisecond" statement.

What I really want to do is to remove special cases and make everything consistent. The Mediawiki MySQL timestamp format is a very special case. I don't really like the epoch timestamp either, but at least it is a little more standard. I'd like to use ISO-8601 date-times instead, but keep the integer timestamp for backwards compatibility with downstream consumers, including the EventLogging MySQL database.

There are also special case issues with the userAgent field. See T179625 for proposal on how to fix all this.

we should decide and implement the timestamp for the Hive-refined tables first,

The EventLogging analytics source (varnish) timestamp is ISO-8601, the EventLogging EventBus timestamp is ISO-8601 the timestamps in other Hive tables is ISO-8601. To be consistent, and to reduce special casing, the timestamps for EventLogging analytics data in Hive should also be ISO-8601.

What do you think about the idea to reach out on Analytics-l?

In this particular case, I don't think the format is up much for discussion. Perhaps we can make a Hive UDF that makes transforming the date-time into Mediawiki timestamps easier?

changing it back would again make this documentation confusing and misleading for the users of this data, and also reintroduce the factually wrong "millisecond" statement.

Been trying to find the proper format for an integer timestamp in JSONSchema. It looks like utc-milliseconds was removed in draft04 anyway.
https://tools.ietf.org/rfcdiff?url1=draft-zyp-json-schema-03.txt&url2=draft-zyp-json-schema-04.txt

So, we should probably just make this simply an integer.

fdans edited projects, added Analytics-Kanban; removed Analytics.
fdans moved this task from Next Up to In Progress on the Analytics-Kanban board.

Aye! but MySQL users are not the only user of this data! The performance team folks use it build Grafana dashboards.

Hm, I wonder if they need timestamp though. They might not. If we add a ISO-8601 dt field, and no one else but MySQL uses timestamp, perhaps we can just keep timestamp and make it full on Mediawiki format in the JSON data for backwards compatibility. Am asking Timo. We'll also need to change Camus imports to use dt instead of timestamp.

I talked with Timo and with the A-team, and we decided it'd be best to remove timestamp altogether in favor of dt. This will make future maintenance and transitions (possibly off of python based EventLogging code) easier, as there will be fewer special cases. This does mean that users will have to change queries to use Hive, but they'd have to do that anyway if we used the integer timestamp. We would like to provide a UDF to help with that, something along the lines of SELECT MediawikiTimestamp(dt) ....

Talked with Nuria and Dan a bit today, and I feel like I should say a little more about this.

The JSON Refinement project, AKA EventLogging in Hive T162610, is an effort to be able to consume arbitrary "schema-less" (I can explain this more if you are interested) JSON datasets into Hadoop, a strongly typed world. EventLogging analytics (wiki based schemas) is not the only JSON data out there. We will be consuming data from EventBus, Mediawiki Job Queue events, and more as time goes on. We don't want to have to maintain special cases for all these disparate sources of data.

JSON Refinement is a new way of ingesting data into Hive, that will be used to create Hive tables of EventLogging analytics data.

The Popups Hive table is using a not-yet-ready for production version of this code. I've done the first step of productionizing the refinement process already (you can see a bunch of EventLogging Hive tables in the Hive event database), but we are still working out some kinks. This timestamp discrepancy you found, as well as the userAgent one mentioned in T179625, are some of those kinks.

We are doing our best to maintain backwards compatibility with existing systems, like the EventLogging MySQL databases. (It does seem there was some bad mojo around that parsed userAgent change T153207; apologies, we need to do better than that). But we won't be able to be 100% backwards compatible in new systems. There will be changes in any new system, and we need to do our part to communicate those changes and provide instructions on how to work with them. Documenting how to use EventLogging analytics tables in Hive will be done, and announcements to analytics-l will be made. This project is still in the tweaking stages, so that's why there hasn't been any communication about this yet.

Aye! but MySQL users are not the only user of this data! The performance team folks use it build Grafana dashboards.

Hm, I wonder if they need timestamp though. They might not. If we add a ISO-8601 dt field, and no one else but MySQL uses timestamp, perhaps we can just keep timestamp and make it full on Mediawiki format in the JSON data for backwards compatibility. Am asking Timo. We'll also need to change Camus imports to use dt instead of timestamp.

This may already have been clarified elsewhere, but just in case: They do in fact have timestamp-based dashboards, e.g. https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&orgId=1 .

On the other hand, I just learned from @elukey and @JAllemandou that Grafana timestamps (at least those underlying dashboards like https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Popups ) are actually not identical to those in the event capsule:

[10:16:13] <joal> HaeB: Grafana don't use inside-event-timestamps - it uses machine timestamp (therefore reflects events flowing)
[10:18:03] <HaeB> inside-event in what sense?
[10:18:35] <HaeB> they are all server-side timestamps, no?
[10:18:41] <joal> HaeB: inside-event meaning a timestamp stored in the event data, allowing for delay in treatment
[10:19:06] <elukey> the timestamp is the one that varnishkafka registers when sending the event to kafka
[10:19:30] <joal> HaeB: in grafana, the timestamp is server-side generated by kafka - while the timestamp in EL events is generated by VK

[10:19:30] <joal> HaeB: in grafana, the timestamp is server-side generated by kafka - while the timestamp in EL events is generated by VK

This isn't totally true, the timestamp (for all statsd based metrics, and most graphite metrics) is based on the timestamp that statsd/graphite receives the message, not on any content timestamp at all. To get graphs based on event content timestamp, you need to send to graphite specifying the timestamp itself somehow.

I talked with Timo and with the A-team, and we decided it'd be best to remove timestamp altogether in favor of dt. This will make future maintenance and transitions (possibly off of python based EventLogging code) easier, as there will be fewer special cases.

I still think such a large breaking change should have been discussed with a wider group of affected users. But to go forward:

This does mean that users will have to change queries to use Hive, but they'd have to do that anyway if we used the integer timestamp. We would like to provide a UDF to help with that, something along the lines of SELECT MediawikiTimestamp(dt) ....

Not quite sure what "anyway" referred to exactly, but I have broken this proposal out into a new task: T186155

Not quite sure what "anyway" referred to exactly

This refered to the fact that the timestamp field in EventLogging data was never in MediaWiki format. It was always an integer. There is custom code to convert this to the MediaWiki timestamp format in the EventLogging MySQL consumer, which is why you see MediaWiki timestamps in MySQL.