Page MenuHomePhabricator

Clients need to generate an ISO 8601 formatted timestamp
Open, HighPublic

Description

In order to send events having the meta.dt field. Product teams have indicated that these timestamps must be generated client-side, and this is doubly important due to the upcoming buffering of HTTP requests for sending events. Should this be something incorporated into MediaWiki core?

Also something to debate is whether the time recorded in the timestamp should be in UTC (as supported natively in ECMAScript 5), or reflect the local time offset (still being formatted in ISO 8601). The latter is supported in the ISO spec, just not in ECMAScript 5. Both could theoretically be supported, but for the moment I'm interested in what we would want for e.g. client-generated events which need to carry a client-side timestamp as one of their properties.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If we can, everything in UTC is preferred. I think if we want to record the client's timezone, we should either use a different dt field, or just record the timezone itself as its own field.

If we can, everything in UTC is preferred. I think if we want to record the client's timezone, we should either use a different dt field, or just record the timezone itself as its own field.

Yes I agree. I didn't even think much about this but I wanted to make sure we considered use-cases that might somehow want the timezone offset. I agree that the best approach would be to allow schemas desiring one to provide a separate field for it. Otherwise the 'dt' will not be specified very strictly and it will be a PITA to query.

I can get down with UTC in meta.dt with a separate field for timezone – e.g. meta.tz.

Just want it noted that to facilitate querying in Hive, where there's a from_utc_timestamp({any primitive type} ts, string timezone) UDF (see Date Functions), the timezone should – ideally – be what Hive supports. (I have not been able to find such a list.)

@Ottomata: what are your reservations against local time + UTC offset?

Ottomata added a comment.EditedDec 12 2019, 3:34 PM

It isn't just hive. The dt is used for the Kafka message timestamp (it should get propertly translated into a UTC unix ts before set though independent of what the ISO TZ is), as well as for hive hourly bucket partitioning. I believe that Camus will properly interpret the ISO TZ when bucketing, but this means that we would partitions that might say e.g. hour=0, but records in that partition might have meta.dt = 2019-12-12T05:00:00+0500. Not a huge deal, but could be confusing.

Another issue: we decided to support server side setting of meta.dt if not set by client in EventGate: T240477: eventgate-wikimedia should fill in defaults for some important fields. If we do this, EventGate will always set UTC, independent of what the client's TZ if. This might get confusing if some clients in +0500 set meta.dt (with TZ info), but others don't.

Anyway, if there were really good reasons to not always use UTC, we I'm sure we could figure out how to support it, but I'm afraid it would complicate things unnecessarily.

(removed comment about sortability, that was wrong)

jlinehan added a comment.EditedDec 12 2019, 5:05 PM

Another issue: we decided to support server side setting of meta.dt if not set by client in EventGate: T240477: eventgate-wikimedia should fill in defaults for some important fields.

I think even if we have support for this, that it should not be used on analytics events. In general I am extremely wary of these defaults becoming too widely used. I think for the GeoIP and (arguably) UA use-cases that they were originally conceived for, there is unfortunately no alternative. But I think to go beyond that is something that should be done with great caution.

The case of 'dt' would illustrate why I am cautious, because in order to tell whether the value was recorded on the client or on the server, one would have to check what the defaults were etc., and that is something that may or may not be apparent. Plenty of people right now are not aware that timestamps are generated server-side in the current system.

With analytics events, we have a very clear requirement that timestamps be generated client-side. I think that the concern about 'dt' we discussed on the error logging client was that @Nuria pointed out that some browsers did not support Date.prototype.getISOString(). I think that is best remedied with a polyfill in MediaWiki. If no polyfill can be arranged (for some reason), then I would rather see the browser provide what it can as a high-resolution UTC timestamp in some format, and have EG convert that to ISO 8601 format, rather than compute its own UTC time.

jlinehan triaged this task as Low priority.Dec 13 2019, 5:25 PM
jlinehan moved this task from Inbox to Done! on the Better Use Of Data board.

Okay, so looking at Date.prototype.toISOString()'s support across browsers on https://kangax.github.io/compat-table/es5/#test-Date.prototype.toISOString (from https://caniuse.com/#search=Date.prototype.toISOString):

So, really, the big offender is IE 8 – which we still support at a basic level https://www.mediawiki.org/wiki/Compatibility#Browser_support_matrix

I wanted to see how limited we would be if we, say, just said "no MEP-CL events from IE 8" so I looked at IE versions of events we received from/to certain schemas in 2020-02:

use event;
select distinct 'EditAttemptStep' as `schema`, useragent.browser_major
from editattemptstep
where year = 2020 and month = 2
  and useragent.browser_family = 'IE'
union all
select distinct 'SearchSatisfaction' as `schema`, useragent.browser_major
from searchsatisfaction
where year = 2020 and month = 2
  and useragent.browser_family = 'IE'
union all
select distinct 'WikipediaPortal' as `schema`, useragent.browser_major
from wikipediaportal
where year = 2020 and month = 2
  and useragent.browser_family = 'IE'
union all
select distinct 'VirtualPageView' as `schema`, useragent.browser_major
from virtualpageview
where year = 2020 and month = 2
  and useragent.browser_family = 'IE'
union all
select distinct 'HomepageVisit' as `schema`, useragent.browser_major
from homepagevisit
where year = 2020 and month = 2
  and useragent.browser_family = 'IE'
schemaIE 2???IE 3???IE 4???IE 5???IE 6IE 7IE 8IE 9IE 10IE 11IE 42582???
EditAttemptStep
HomepageVisit
SearchSatisfaction
VirtualPageView
WikipediaPortal

Okay, so user agent parsing isn't perfect but it does tell us that we shouldn't just cut out IE 8 users – as that's still valuable data for teams (e.g. Growth team's Homepage analytics).

So it looks like it's worth investing in adding a polyfill to MediaWiki, as mentioned earlier.

@mpopov I thought we removed js support for IE8 a while back. See: https://phabricator.wikimedia.org/T118303 and https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup/startup.js

Do not believe those UAs , most IE<insert-low-number-here> are bots

@Nuria: Oh cool! Thank you for the links! That makes this a lot easier.

Alright, folks. Let's just use new Date().toISOString() to populate meta.dt on client-side, then??

@Nuria: Oh cool! Thank you for the links! That makes this a lot easier.

Alright, folks. Let's just use new Date().toISOString() to populate meta.dt on client-side, then??

We might still need to gate the code to check for Date.prototype.toISOString to avoid throwing exceptions on unsupported clients, but we can discuss in CR. I'm fine with marking this resolved since it looks like it won't require a MW patch.

mpopov closed this task as Resolved.Feb 28 2020, 7:14 PM

+1 to doing a support check

jlinehan reopened this task as Open.Apr 28 2020, 7:06 PM

Re-opened this task after @nshahquinn-wmf mentioned that it would be nice to distinguish between client-generated and server-generated values for meta.dt. This seems reasonable, we should continue to track this and adjust as necessary.

+1. Maybe EventGate can set a dt_received (or similarly named field)

Maybe EventGate can set a dt_received (or similarly named field)

I was ok with making eventgate set meta.dt since it is a field we require for all schemas. Instead of adding more custom logic to eventgate, perhaps we should let let eventgate set meta.dt, but if events need a more specific event time (and don't want to set meta.dt) they can set it in different field?

We do use meta.dt for the timestamp in Kafka though. This means that it will be used for both the Hive hourly partitioning, and for any timestamp based consumption in Kafka. If this is different than the event time, the semantics might be off.

mpopov added a comment.EditedMay 11 2020, 3:59 PM

Okay, after chatting with @nshahquinn-wmf about this, here's a proposal:

  • In the secondary (analytics) repository, we change the meaning of meta.dt to be received timestamp and require that analytics clients DO NOT set it, so that EventGate can set it. This does not require any new custom logic for EventGate – it just does what it's been configured to do.
  • We then add a meta.client_dt (or similarly named field) as a common field in the secondary repo and change the event platform client / Ext:EventLogging to fill THAT field with the client-side timestamp to track when the event was generated.
  • It's okay to use meta.dt for partitioning since we can't trust the client's date/time settings (unless that client is us, but that stuff lives in the primary (mediawiki) repository anyway)

Again, in the case of primary "tier 1" mediawiki events where (1) there's not much of a lag between when an event is generated and when it is received, and (2) we trust the client's date/time settings because that client is us, everything's good. In the case of secondary "tier 2" analytics events, (1) there may be significant lag between when an event is generated, queued up, sent, transmitted, and received, and (2) we absolutely cannot trust the client's date/time settings to be accurate for 100% of the clients, everything is not good. Therefore, we definitely have a need for a distinction between timestamps marking event generation vs event receipt. It is much easier to change the instrumentation than it is to change EventGate, so the above approach seems like the way to go.

@jlinehan @Ottomata: what do you think?

I think I like. It might make fancy stream processing stuff in the future confusing if we don't use the event time as the Kafka timestamp, but I think for now this is a good and non intrusive solution.

As for naming, this concept of client time is usually referred to as 'event time'. I'd also avoid adding more to meta if we can. Could the analytics common schema just have an event_dt field top level for this?

Yay!

As for naming, this concept of client time is usually referred to as 'event time'. I'd also avoid adding more to meta if we can. Could the analytics common schema just have an event_dt field top level for this?

Sure, event_dt top level field sounds good to me if meta is off-limits.

For our future reference, is there a specific reason why meta needs to be kept as-is as much as possible? Feel free to link if you've written about this elsewhere. I hope this doesn't come across as adversarial, it just seems that metadata about an event such as event time should go into meta! I'm genuinely curious and just want to understand the restriction so we can avoid making suggestions like that in the future.

If I could go back in time I would get rid of the meta subobject altogether. meta is currently defined in the primary fragment/common schema, and jsonschema-tools isn't so great at merging subobjects. So, to define your own meta, you'd have to make an entirely new meta in your fragment/analytics/common and copy/paste the needed fields from fragment/common and not reference the primary fragment/common one at all. I think it will be simpler if we make all schemas reference the same primary fragment/common schema, and any other 'common' fields be defined by a separate fragment schema, e.g. fragment/analytics/common.

Just had a discussion (prompted by other things) with Joseph and Nuria. If possible, we'd like to reject events that have meta.dt set to values too far from the current time. Since other events already use and set meta.dt to represent event time, we can't and shouldn't make eventgate always set it to receive time. We use meta.dt for the Kafka timestamp as well as the Hive hourly partitioning key. If meta.dt is way off, the event might be lost in raw data and never make it to refined Hive tables. I think we should be able to restrict the value off meta.dt using JSONSchema validation. Not sure, but we should try!

I suppose that's fair, although the follow-up question is how far is too far? We still would like events which are queued up and persisted, perhaps for up to a day or more, to make it into the database. If a mobile app user reads articles offline (via the reading list feature) for a week and then comes online at the end of the week, the app would start sending any events generated while they were offline. What happens to them? Which partition do they end up in?

Would you be okay with adding logic to EventGate to set a received_dt, then?

Ah, the rejection would only happen if the client sets meta.dt to something bad. If you leave it blank, eventgate will set it and it will effectively be receive time. You can still set event_dt to whatever you want! :)

@jlinehan @nshahquinn-wmf @mforns @Ottomata: Okay, so how do y'all feel about the following:

  • Add event_dt as a top-level field in fragment/analytics/common schema
  • Change EPC for iOS/Android and the MEP component of EventLogging to record event generation time in event_dt instead of meta.dt (ext.eventLogging/core.js#L129-L134)
  • Document in code and in schema that meta.dt is left empty at log time in order to have EventGate set it, making it receive time

@EBernhardson: Hi!!! Just wanted to ping you and see what you thought about this proposal to have generated & received timestamps in analytics events via EventLogging core?

Makes sense to me!
Maybe I'd use another name for event_dt that explains how it is different from meta.dt?
Maybe client_dt?

@mforns I suggested event_dt in https://phabricator.wikimedia.org/T240460#6125463 mainly because that is what this is referred to in stream processing worlds. Also, client_dt might not make sense in other situations, like a syslog event or something. This field will be used to specify the time the event happened.

I agree though, having meta.dt sometimes be event time and sometimes the 'eventgate receive time' might get a bit confusing, but I don't think we can change meta.dt now. It is already being used as the event time for all of the EventBus sent mediawiki events. Also, if we can, we do want to use the event time as the Kafka timestamp and Hive partition timestamp, its just that in some cases where we don't trust the producers we can't be sure they will be setting event times that make sense.

Change 597617 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Add fragment/analytics/common schema

https://gerrit.wikimedia.org/r/597617

Although, on second thought, maybe client_dt is better here after all? Since this will be used only by analytics / instrumentation schemas, perhaps client makes sense? meta.dt is the default event time and will always be used for kafka timestamps and hive partitioning. Maybe client_dt is actually more descriptive for what it is being suggested for here.

@mpopov @jlinehan while working on some docs, I submitted https://gerrit.wikimedia.org/r/597617 so that my docs wouldn't be totally incorrect :p. Let me know what you think.

mpopov added a comment.EditedMay 27 2020, 3:21 PM

Although, on second thought, maybe client_dt is better here after all? Since this will be used only by analytics / instrumentation schemas, perhaps client makes sense? Maybe client_dt is actually more descriptive for what it is being suggested for here.

With @mforns preferring client_dt, myself preferring client_dt, and @jlinehan being more comfortable with client_dt than event_dt, it kinda looks like we should go with client_dt :P Updating patch to use that instead...

Ok let's do it.

Change 597617 merged by Bearloga:
[schemas/event/secondary@master] Add fragment/analytics/common schema with client_dt

https://gerrit.wikimedia.org/r/597617

Change 599899 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/EventLogging@master] Record event generation time in client_dt

https://gerrit.wikimedia.org/r/599899

Change 599899 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Record event generation time in client_dt

https://gerrit.wikimedia.org/r/599899

Change 604147 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Make analytics/legacy/eventcapsule use analytics/common

https://gerrit.wikimedia.org/r/604147

@Ottomata: unless I'm mistaken your port of search satisfaction to MEP is the only way we have of seeing how this is working in production, right?

There is nothing using mw.eventLog.submit yet. Either the SearchSatisifaction port, or the session_tick event, whichever comes first :)

Change 604147 merged by Bearloga:
[schemas/event/secondary@master] Make analytics/legacy/eventcapsule use analytics/common

https://gerrit.wikimedia.org/r/604147

sdkim moved this task from Done! to Inbox on the Better Use Of Data board.Thu, Oct 29, 4:38 PM
Ottomata added a subscriber: nettrom_WMF.EditedThu, Nov 5, 8:00 PM

Hey all we need to revisit one decision here.

From: https://wikitech.wikimedia.org/wiki/Event_Platform/EventLogging_legacy#timestamp_field_semantics

The deprecated eventlogging server side would set the dt field to the time it received the event. This is no longer true once a legacy EventLogging stream is migrated to Event Platform. The new semantics are

  • meta.dt - server side receive time (unless the client explicitly sets this field). This field is used for hive hourly partitioning.
  • client_dt - client side date-time timestamp.
  • dt - deprecated, but set to the same value of client_dt

dt is a legacy EventLogging field, but it is very likely that many people are using this field for analysis. By setting it to client_dt, this changes the semantics. In order to keep compatibility, I propose we default dt to the eventgate receive date-time, just like meta.dt This will keep it a server side receive timestamp (if it is not set by the client), just as it is in the deprecated eventlogging backend.

@mpopov @jlinehan @Mholloway @mforns @nettrom_WMF thoughts?

mforns added a comment.Thu, Nov 5, 8:09 PM

In order to keep compatibility, I propose we set dt to the same value of meta.dt in eventgate-wikimedia. This will keep it a server side receive timestamp, just as it is in the deprecated eventlogging backend.

Makes sense to me.

I updated my comment slightly, the proposal is to default it to the eventgate receive time (not the value of meta.dt). Most of the time meta.dt and dt will be the same, unless the client explicitly sets one or the other.

Change 639621 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Default dt to now

https://gerrit.wikimedia.org/r/639621

In order to keep compatibility, I propose we set dt to the same value of meta.dt in eventgate-wikimedia. This will keep it a server side receive timestamp, just as it is in the deprecated eventlogging backend.

Makes sense to me.

+1

Change 639622 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/639622

jlinehan added a comment.EditedThu, Nov 5, 10:27 PM

@Ottomata One thing to be aware of (which slipped my mind) is that we plan to have the dt field be used for the *client-side* timestamp (today this is client_dt). The reason for this change is

  • The client-side timestamp is the only intuitive data that would be in a top-level timestamp field, since it corresponds to when the event was generated.
  • It is also the most diagnostically relevant for analysts, and hence should be afforded the prominence of a succinct, top-level field without qualifiers.

Therefore if we do set dt to be the server-side timestamp for these schema, we should be aware that those semantics are either a.) going to be different for new schema or b.) change at some point in the future for these legacy schema. So... I wonder if we shouldn't just have dt be the client-side timestamp?

The downside here is that before the migration, the timestamps will be server-side (legacy semantics), while afterwards, they will be client-side. We can explore whether this is a problem, though (@mpopov @nettrom_WMF @nshahquinn-wmf -- any thoughts?).

A couple words about the server-side timestamp since we're on this ticket -- while initially we dismissed any server-side timestamp as being of no value, what we've seen (especially on the apps) is that a server-side timestamp is really useful for helping us measure the delta between event generation time and event ingestion time, which tells us a lot of information about network conditions, how our buffers are operating, and how stale the data is.

Therefore I think we will continue to want both, it's just a matter of how they are labeled. In particular though, the server-side timestamp doesn't need to be hiding somewhere out of sight (as it arguably is in meta.dt). But we can tackle some kind of 'promotion' from meta.dt to a better-named field sometime down the road -- I understand that meta.dt is intertwined with the implementation of the pipeline in certain places.

hence should be afforded the prominence of a succinct, top-level field without qualifiers.

By qualifiers here you mean the 'client_' part?

while initially we dismissed any server-side timestamp as being of no value,

Server side timestamps are also necessary for backend processing, since client side timestamps can be arbitrarily set to something in the future or in the past.

In particular though, the server-side timestamp doesn't need to be hiding somewhere out of sight (as it arguably is in meta.dt)
is that we plan to have the dt field be used for the *client-side* timestamp

I'm not sure I understand the problem here. In all cases, the server side only defines these as defaults. If the client sets any of these timestamps, the server side won't touch it.

Why not just keep meta.dt being the server side timestamp (unless the client sets it), and client_dt being the client side event timestamp, and then keep dt around only for backwards compatibilty as a server side timestamp?

jlinehan added a comment.EditedFri, Nov 6, 8:03 PM

hence should be afforded the prominence of a succinct, top-level field without qualifiers.

By qualifiers here you mean the 'client_' part?

Yup!

while initially we dismissed any server-side timestamp as being of no value,

Server side timestamps are also necessary for backend processing, since client side timestamps can be arbitrarily set to something in the future or in the past.

Sorry, I should have clarified -- I mean 'no value' for the purposes of data analysis, i.e. we thought a client-side timestamp replacing a server-side timestamp was all we cared about. So, in other words, yes we have both client- and server, but when we didn't care about the server-side timestamp, the thinking was "if the system wants to stick a server-side timestamp on it, the semantics can be whatever it wants, and it can call it whatever it wants, because we won't be using it for analysis."

Now, though, we know that we are actually interested in server timestamps, so it would be nice if the server and client timestamps were both named in a nice, regular way and followed regular semantics. However this is kinda off topic for the point I'm trying to make, sorry.

I'm not sure I understand the problem here. In all cases, the server side only defines these as defaults. If the client sets any of these timestamps, the server side won't touch it.

Why not just keep meta.dt being the server side timestamp (unless the client sets it), and client_dt being the client side event timestamp, and then keep dt around only for backwards compatibilty as a server side timestamp?

We have three timestamp fields (dt, meta.dt, client_dt) with variable semantics, and we only need two timestamp fields with fixed semantics. Why not use this fix to kill one of the fields and go to fixed semantics? The field to go ought to be client_dt -- it's the ugliest, but it's also the only field that can go, since the legacy schemas will have a dt field. We probably should have just used dt to begin with, but that's hindsight.

Using dt for the client-side timestamp will clash with the legacy semantics of the dt field (where it's a server-side timestamp), but I don't think it matters. In a few months, nobody will even remember that it used to be server-side. From a design perspective, it makes more sense for the dt field to be, well, the timestamp of when the event happened, rather than rope it off as a legacy-only thing.

TL;DR - we should make the dt field the client-side timestamp, rip out client_dt, and keep using meta.dt as the server-side timestamp.

TL;DR - we should make the dt field the client-side timestamp, rip out client_dt, and keep using meta.dt as the server-side timestamp.

Hm. I'd be ok with this. And, I think we can get away with not changing legacy semantics. In the patch I made above, eventgate-wikimedia will only set dt if it isn't already set. So, in EventLogging JS, we can avoid setting it for legacy events (ones that come through mw.eventLog.logEvent), and it will remain as a server side receive timestamp. For new events (via mw.eventLog.submit), just set dt to the client's timestamp.

Although, for non-analytics events, meta.dt is not necessarily the server side receive timestamp: EventBus sets this as the 'event time', e.g. when a revision was created. Ideally we'd use the external client side event time for meta.dt as well, but we can't because we can't trust the external clients to always set a meta.dt that makes sense or will be accepted by the system.

client_dt is already baked into a few schemas though, and is the only field declared by the fragment/analytics/common schema. I don't mind changing this, but is it really worth it? client_dt is pretty descriptive, no? If we were starting from scratch I'd say ok, fine, dt is client side time, but we already have client_dt which seems fine to me too. Is it worth changing things?

client_dt is already baked into a few schemas though, and is the only field declared by the fragment/analytics/common schema. I don't mind changing this, but is it really worth it? client_dt is pretty descriptive, no? If we were starting from scratch I'd say ok, fine, dt is client side time, but we already have client_dt which seems fine to me too. Is it worth changing things?

Yeah... these more basal schema / fields are going to have a long life (hopefully) and so decisions we make now will have a large impact, so it's worth it to me to adjust as much as possible while we have as few schema as possible and can manually intervene. There are some other things we're doing as well that are going to require adjustments to the few analytics schema that are out there, so it's in the same line with that. At the moment all of the analytics schema are either running side-by-side with a legacy schema (apps) and can have their table dropped, or are legacy migrations. The legacy migrations might have extra fields (e.g. client_dt) hanging around for a while, but after the migration, we'll gradually be upgrading these to use new patterns and stuff, so the old tables will probably be tossed or archived, anyway.

Yeah... these more basal schema / fields are going to have a long life (hopefully) and so decisions we make now will have a large impact,
The legacy migrations might have extra fields (e.g. client_dt) hanging around for a while, but after the migration...

I find it very unlikely that we will disable all legacy EventLogging streams anytime soon. I'd predict that some of these will be online and necessary for many years to come. If that is true, and we choose to make dt a client side timestamp for only new schemas, but leave it as a server side timestamp for legacy EventLogging schemas, won't we cause some confusion for a long time to come? Your points I quote here make me think that keeping client_dt as the official client side timestamp is the right thing to do.

But! I don't care that much! @mpopov you were originally proposing client_dt here, any thoughts?

Yeah... these more basal schema / fields are going to have a long life (hopefully) and so decisions we make now will have a large impact,
The legacy migrations might have extra fields (e.g. client_dt) hanging around for a while, but after the migration...

I find it very unlikely that we will disable all legacy EventLogging streams anytime soon.

I'm not talking about disabling legacy streams, I'm pointing out that we have plans, after migrations, to update many of these in a more comprehensive way that will likely result in alterations to their tables. Alterations that are extensive enough that for many of them, it is likely that they will receive an entirely new table, and the old one will be archived, or discarded. My point here is that if we discontinue use of client_dt, it will live on in these tables as a zombie field but it's fine.

If that is true, and we choose to make dt a client side timestamp for only new schemas, but leave it as a server side timestamp for legacy EventLogging schemas, won't we cause some confusion for a long time to come?

No, that is not what I'm saying. As of the merging of whatever patch we make, dt is the client-side timestamp for all analytics events, all the time. meta.dt is the server-side timestamp for all analytics events, all the time. Never the twain shall meet.

We will update the new-style client code path to set dt rather than meta.dt, allowing EventGate to set the server-side timestamp. We need to make changes to the analytics base schema to add a dt field, which is happening right now as part of a related effort to align schema practices. The patch you're working on can update the legacy code path to set the dt field to the client-side timestamp, and not set meta.dt. Maybe the problem your having is that you can't stop the submit method from setting meta.dt. We can jam on a workaround or just wait until the new schema with dt come online.

Ottomata added a subscriber: Pchelolo.EditedMon, Nov 9, 9:08 PM

Ok! Just discussed with Jason and I'm convinced. If we are going to do this right once and for all, we need to do it right for ALL events, not just externally emitted instrumentation events. This means for EventBus events too. Here's the plan:

  • dt is always a client AKA event timestamp.
  • meta.dt is always a server side receive timestamp.
  • Which timestamp field is used for Kafka index and Hive partitioning is configurable.

Right now, meta.dt is always used for partitioning. In an ideal world we'd always use the client's event timestamp. However, an external client might send any timestamp value, even one years in the future, so we can't trust them! So, we need to vary which timestamp we use for this, and since we have different EventGate instances and also Camus-HDFS ingestion jobs for different EventGate instances, we can configure those things appropriately:

  • eventgate-*-external instances will use the server side meta.dt timestamp for partitioning (since we can't trust the client's dt in this case for that), but fall back to dt if not meta.dt is not set.
  • All other instances will use client event dt timestamp for partitioning, but fall back to meta.dt if not set.

This is easy to do now for the instrumentation events. To do it for production MediaWiki events, we need to update EventBus and related schemas:

  • Add a dt field to all schemas.
  • EventBus should set dt, but not set meta.dt, which will allow EventGate to set it.

IIRC, some events do not use EventGate, they produce directly to Kafka. Those producers should set dt instead of meta.dt.

These changes should be relatively unnoticeable for users of MediaWiki production events, as the difference between the client and server receive timestamps will usually be miniscule.

@Pchelolo let me know what you think, and if we should discuss further. We're going to proceed with this plan for analytics events soon, but we can make the EventBus related changes at any time later.

Oh, @jlinehan for legacy EventLogging events, we then will make EventLogging JS not set dt, making it actually be a server side timestamp for the legacy events, right? The semantics of dt will change, but only for new instrumentations, not for existing event data, correct?

eventgate-*-external instances will use the server side meta.dt timestamp for partitioning (since we can't trust the client's dt in this case for that), but fall back to dt if not meta.dt is not set.

Is it wise to fall back to untrusted dt? wouldn't it be better to use current time as fallback?

@Pchelolo let me know what you think, and if we should discuss further. We're going to proceed with this plan for analytics events soon, but we can make the EventBus related changes at any time later.

In general - yeah, makes sense. I'm in doubt about naming... meta.dt vs dt is not a very obvious distinction, but without renaming existing meta.dt I can't come up with a better naming scheme. Would've been awesome if it was like meta.event_dt and meta.receive_dt or something... Not proposing to do it, just rambling. With good documentation it would be ok to have just dt everywhere.

Is it wise to fall back to untrusted dt? wouldn't it be better to use current time as fallback?

Hmmmm actually I started writing that originally but then thought it would cause problems. On second thought...yes. Both EventGate and Camus can fall back to current time if the field is missing.

Ok @jlinehan @mforns I updated my patches for legacy dt, please review:

@jlinehan, I'll let you or @Mholloway make the changes for non legacy schemas to use dt instead of client_dt. This will require schema changes for any of the existing schemas, and also a change to mw.eventLog.submit to set dt instead of client_dt, but shouldn't require any changes to eventgate-wikimedia.

I'll make a low priority subtask for modifying EventBus events and ingestion code to use either meta.dt or dt as stated above.

Restricted Application added a project: Analytics. · View Herald TranscriptTue, Nov 10, 3:37 PM

@jlinehan, I'll let you or @Mholloway make the changes for non legacy schemas to use dt instead of client_dt. This will require schema changes for any of the existing schemas, and also a change to mw.eventLog.submit to set dt instead of client_dt, but shouldn't require any changes to eventgate-wikimedia.

Yep, willll do!

fdans raised the priority of this task from Low to High.Mon, Nov 16, 4:28 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.
fdans added a project: Analytics-Kanban.

Change 639621 merged by Ottomata:
[eventgate-wikimedia@master] Default dt to server side receive time for legacy EventLogging events

https://gerrit.wikimedia.org/r/639621

Change 641266 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production

https://gerrit.wikimedia.org/r/641266

Change 641266 merged by Ottomata:
[operations/deployment-charts@master] eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production

https://gerrit.wikimedia.org/r/641266

I've deployed the eventgate-wikimedia change in all staging clusters. I'll do the prod ones tomorrow. If that all looks good I'll merge the EventLogging change and schedule it for a backport window...or just wait for the train.

Change 639622 merged by Ottomata:
[mediawiki/extensions/EventLogging@master] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/639622

Change 641799 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.16] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/641799

Change 641799 abandoned by Ottomata:
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.16] Allow EventGate to default dt to server side receive time

Reason:
bad change-id

https://gerrit.wikimedia.org/r/641799

Change 641801 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.16] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/641801

Change 641768 had a related patch set uploaded (by Urbanecm; owner: Ottomata):
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.18] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/641768

Change 641801 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.16] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/641801

Change 641768 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@wmf/1.36.0-wmf.18] Allow EventGate to default dt to server side receive time

https://gerrit.wikimedia.org/r/641768

Mentioned in SAL (#wikimedia-operations) [2020-11-18T19:45:21Z] <otto@deploy1001> Synchronized php-1.36.0-wmf.18/extensions/EventLogging/modules/ext.eventLogging/core.js: EventLogging legacy events should use dt as server side receive time - T240460 (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2020-11-18T19:48:46Z] <otto@deploy1001> Synchronized php-1.36.0-wmf.16/extensions/EventLogging/modules/ext.eventLogging/core.js: EventLogging legacy events should use dt as server side receive time - T240460 (duration: 01m 06s)

FYI @nettrom_WMF, I just deployed the change today to make legacy EventLogging events have dt be a server side receive time (this should match exactly meta.dt). They will still have client_dt as the client side timestamp as well.

Also FYI, in the near future in non-legacy schemas in event platform dt will be a client side timestamp, and meta.dt will be server side receive time.