Page MenuHomePhabricator

Develop strategy for mitigating degenerate client timestamps in event data
Open, Needs TriagePublic

Description

The client_dt field contains a timestamp that is derived from a software client. Most clients provide no authoritative source of time, and therefore it is not uncommon for them to report times that are a significant distance in the past or future. There are standard strategies for mitigating these problems, all we need to do is adopt one and decide where to locate the intervention.

Event Timeline

In order to mitigate significant variance on the client-side generated timestamps, could we establish a beacon that could generate server-generated timestamps in coordination with the device-initiated burst requests? It might be much safer to use the server as the source of truth.

Events coming through eventgate that do not set meta.dt (which EventLogging extension does not), will have meta.dt set to the time that eventgate receives the event. client_dt is set by clients. So the data has both the server receive time and the client's event time.

Hourly partitioning in Hive uses meta.dt.

@Ottomata: for clarification, what @dcipoletti is talking about is changing where client_dt gets that client-side timestamp from. It would still be the client-side time of when the event was generated, but instead of trusting the client to have an approximately accurate and up-to-date time we could potentially:

  1. query something like https://wikimedia.org/api/rest_v1/time possibly POSTing the client-side time at request to do latency adjustments with response, store {client-side, server-side} pair for converting time from client-side (which may be in the past or future) to server-side time (which we trust)
  2. instead of setting client_dt to new Date().toISOString(), set it to mapClientSideToServerSide(new Date()).toISOString()

So client_dt is still the client-side timestamp but more trustworthy, meta.dt is still set by EventGate on receipt.

Adding #Product-Infrastructure-Team-Backlog as Better Use Of Data/Product-Data-Infrastructure project tags got archived, so this open task has an active project tag and can be found.

A friendly bump.

…and therefore it is not uncommon for them to report times that are a significant distance in the past or future.

Did we/do we have a measure for how much this is happening and to what degree?

I should have said that the reason that the reason for the bump is that I'm sieving the backlog for tasks to guide improving and/or breaking apart the mediawiki/metrics_event schema and this caught my attention.