Page MenuHomePhabricator

Session Length Metric. Web implementation
Closed, ResolvedPublic

Description

Implementing session length on top of refactored eventlogging code that can post events to eventgate.

  • session length schema creation: T254897
  • oozie job to compute session length metrics based on heartbeat per wiki

We need to decide whether to sample the data or- at the beginning- just enable the computation in one wiki

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenOttomata
Openjlinehan
Openjlinehan
Openjlinehan
Resolvedkzimmerman
Resolvedjlinehan
OpenNone
Resolvedjlinehan
Resolvedkzimmerman
Resolvedsdkim
Resolvedjlinehan
Resolvedcchen
Resolvedkzimmerman
Resolvedjlinehan
Resolvedjlinehan
Resolvedmpopov
Resolvedsdkim
Resolved Gilles
Resolved Mholloway

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jlinehan triaged this task as Medium priority.Apr 8 2020, 3:49 PM

Change 604051 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] WIP: Adds sessionLength instrumentation

https://gerrit.wikimedia.org/r/604051

@jlinehan If you haven't already, you may want to look into ReadingDepth and its instrumentation. It seems closely related to this. I thought it was in the past, but it actually still live in WikimediaEvents.

Adding some long overdue updates here for further discussion. I built 3 different versions of this instrument along with a test rig to examine the error in each. The instrument is, abstractly, a clock which runs as long as at least one same-origin page is visible, and at regular intervals along this clock's progress, it emits a tick. There are minor difficulties with this related to reliably determining when a page is visible, how to measure time, how to persist data values, etc., but these are relatively run-of-the mill decisions we're all used to making. No, what lets the darkness in is the fact that you can have more than one page visible at a time (think of a user who has more than one window open at a time to look at the same or different articles). If you do a case analysis, you'll find that one or two cases can introduce an error into the clock (it will appear to run slow). Analyzing, understanding, and determining how to mitigate this error has comprised the bulk of the development lately.

Three approaches were tried, namely:

  1. Minimize error subject to using only page-local relative time (e.g. performance.now()) and only one bookkeeping cookie. Relies on the fact that the error should be infrequently introduced at best.
  2. Sacrifice the error-free operation of the clock in the best case in order to make the expected value of the error zero, by making the clock run both fast and slow in a precise way
  3. Use client local absolute time (e.g. Date().now()) and three bookkeeping cookies in order to eliminate the error, at the price of complexity, causing all pages to fire at nearly the same time, and introducing new source of error (timezone changes) which is admittedly easier to handle.

All three "work," subject to their stated limitations, but after analysis and some tests I think (1) is working well enough that I can't justify the downsides of approach (2) or (3) There is a more complete analysis of the error for (1). in the top comment of the patch code.

I also built a test rig that will simulate having 0-n pages visible or hidden and transition between states with some probability. It can be loaded with different clock types and roughly simulate user behavior in order to get an idea of the error of how each clock ticks versus the actual visible time as measured by the rig. This was helpful to set a baseline, and might be useful for running tests on this instrument. I'll likely try to patch it in.

Here is some plotted output from the test rig (clock A corresponds to design (1) above, and clock B corresponds to design (3))

trial7.png (480×640 px, 39 KB)

trial8.png (480×640 px, 44 KB)

A new patch implementing design (1) has been uploaded. We can discuss whether this is going to be sufficient, whether to use another design, or perhaps something not yet considered, and go from there.

I am a bit lost here as to why do we need clock synchronization at all. If we can asses what tab is visible with the page visibility API and we use local storage to save "the last tick" local storage is our synchronization point, when the new tab picks up, it reads last "tick". Probably I am missing something here. Ticks are not timestamps but rather, just integers.

if two browser windows are visible at the same time only one needs to gain control to send "ticks" . And for that, again, we can use local storage as a synchro point as it is shared across tabs and windows, whatever window grabs the lock first is the one sending ticks. There is also the BroadcastChannel object but it does not have wide support.

Can we re-state the problem we are trying to solve a bit more in depth?

I am a bit lost here as to why do we need clock synchronization at all.

No, we aren't doing synchronization.

If we can asses what tab is visible with the page visibility API and we use local storage to save "the last tick" local storage is our synchronization point, when the new tab picks up, it reads last "tick".

Yes, this is a base case for the clock and has no error. There are two other scenarios (see the diagrams in the comment at bottom) in which error does occur, however. I don't think these cases are immediately obvious (they weren't to me), so please ask more questions.

if two browser windows are visible at the same time only one needs to gain control to send "ticks" . And for that, again, we can use local storage as a synchro point as it is shared across tabs and windows, whatever window grabs the lock first is the one sending ticks.

You don't actually need a lock variable because the tick count itself serves as a lock. If a page records the tick each time it cycles, then at its next cycle it can compare the tick value now to the tick value it saw last time. If they are the same, then it "has" the lock. If they are different, then it does not have the lock. So the lock variable can be eliminated WLOG.

we can use local storage as a synchro point as it is shared across tabs and windows ... there is also the BroadcastChannel object but it does not have wide support.

I avoided using BroadcastChannel or any of the other stuff that is not widely supported.

It is possible to solve this problem using localStorage rather than cookies, with the key benefit being to use the storage event in order to do some primitive IPC, but we decided on cookies over localStorage due to compatibility, consistency with our past decisions on localStorage, and the fact that cookies allow us to easily implement the desired session TTL. I also don't know that we are using the storage event elsewhere, how solid support is, or whether it actually reduces the complexity of the instrument.

In the absence of BroadcastChannel or the storage event to facilitate asynchronous IPC, we land in the present situation, which is not really so bad in my opinion. It has certain advantages. It's just that, to reach the conclusion that it wasn't so bad, it took some time to analyze exactly what was happening so I could bound it.

I'm pasting the comment in the patch which has some diagrams of the cases that introduce an error.

/*!
 * Track session length
 *
 * Description
 * -----------
 * This module implements a clock that does its best to tick once every
 * certain number of milliseconds that at least one same-origin page is
 * visible.
 *
 * If only one page was visible at a time, the algorithm would be simple and
 * without error. However, there may be multiple same-origin pages which are
 * visible at the same time. We designate the first page to tick the "lead",
 * and maintain two cookies: one holding the current tick count, and another
 * holding time accumulated towards a tick.
 *
 * Ignoring fluctuations in duty time and latency due to other factors in the
 * browser environment, the clock will behave well. The clock can run slow but
 * will not run fast. In the majority of cases, there will be no error at all,
 * but in certain situations, error can be introduced. In exchange for this
 * error, we gain the following advantages:
 *
 *  - Only two bookkeeping values need to be persisted
 *  - Only local monotonic time is necessary (no time zones etc.)
 *
 * Error analysis
 * --------------
 * The worst-case error for a session is:
 *
 *      interval * ( ( # times [1] occurs ) + ( # times [2] occurs ) ),
 *
 * where [1] and [2] are specific events defined as:
 *
 *  [1] Page X becomes visible while lead page L is visible.
 *      Lead page L subsequently becomes hidden, and X becomes
 *      the new lead. The error is the difference between the
 *      time X became visible and the time L last ticked.
 *
 *                becomes     would have
 *         ticks   hidden     ticked
 *           |         \      /
 *      L    o==========x----o
 *      X          o===============o
 *                  \               \
 *                  becomes       ticks as
 *                  visible       new leader
 *
 *           |---------------|-----|     ERROR as it appears
 *                 delay      error
 *                                       REASON:
 *           |-----|---------------|     X was started with a phase offset
 *            phase     delay            to L.
 *
 *           -------- TIME ------- >
 *
 *  [2] Page X becomes visible after lead page L has become hidden,
 *      but while another page Y, started before L was hidden, is
 *      still visible. The error will be the difference between the
 *      time X became visible and the time L became hidden.
 *
 *                becomes    would have  would have
 *         ticks  hidden     ticked      ticked (but X ticked first)
 *           |      \         /     _____/
 *      L    o=======x-------o     /
 *      Y         o===============o
 *      X          \    o=======o
 *                  \  /         \
 *                  becomes      ticks as
 *                  visible      new leader
 *
 *           |---------------|--|        ERROR as it appears
 *                 delay     error
 *                                       REASON:
 *           |-------|--|-------|        X loads the time that L spent
 *           progress gap remainder      visible, but doesn't account for
 *                                       the gap having been filled by Y.
 *           -------- TIME ------- >
 *
 * Note that:
 *  - On mobile browsers, it is generally not possible to have more than
 *    one page visible at a time, therefore [1] and [2] cannot occur, and
 *    the worst-case error is 0.
 *
 *  - For any user that does not have more than one page visible at a time,
 *    worst-case error is similarly 0.
 *
 *  - If [1] or [2] are expected to occur, then the interval size will
 *    magnify the potential error.
 *
 *  - The interval size is an upper limit to the amount of error that can
 *    be introduced by either [1] or [2]. In reality, the bound is more
 *    like interval - c for some small c, with the expected value depending
 *    on the distribution of page visible and page hidden events.
 */

Hi @jlinehan!

I reviewed your code, and after a while managed to understand the complexity of the situation. Thanks for the detailed docs, they were super helpful!

Now, I have a question on the sessionLength metric itself:
I understand that the proposed code calculates the user's screen time, and sends heartbeat events once the screen time has reached certain thresholds, correct?

If so, consider the following case: A user is writing an essay, and looks at the wiki tab for 20 seconds every 2 minutes during 1 hour. The rest of the time they are looking at their text editor, and the browser is hidden.

The proposed code will accumulate those 20 seconds every couple minutes. After the whole hour has passed, it will have sent about 30 heartbeats (considering an interval of 20 seconds). Then, when we calculate the session length in the back end, we'll multiply 30 heartbeats times 20 seconds, and will get a session length of 10 minutes approx.

Shouldn't the session length for that use case be 60 minutes instead? Given that the user has been browsing the wiki (on and off) for 1 hour to complete their task? Or do we prefer the screen time as our metric here?

I still have a few questions around the code being quite complicated for the edge cases and I really wonder if we need to account for those at all in first instance or rather we need to implement the most common case and later look at data to tell us how prevalent other edge cases are.

Shouldn't the session length for that use case be 60 minutes instead?

On a traditional definition of session, no. Inactivity for say, 30 minutes (probably less) means the session has ended. "session" is a continuous set of actions that happen in close sequence.

On a traditional definition of session, no. Inactivity for say, 30 minutes (probably less) means the session has ended. "session" is a continuous set of actions that happen in close sequence.

But in the example, the user is reading the wiki page for 20 seconds every 2 minutes, so it has no such inactivity period, right?

Shouldn't the session length for that use case be 60 minutes instead? Given that the user has been browsing the wiki (on and off) for 1 hour to complete their task? Or do we prefer the screen time as our metric here?

Great observation. This instrument (as it's written right now) would more accurately be called SessionVisibleLength.

My take is that what leadership wants is eyeballs-on-screen-time, which is what this instrument does. From a traditional advertising perspective, this is all that matters. BUT, the overall length of a session could also be important to know.

@Nuria, were we trying to capture eyeballs-on-screen-time? If so, I think we should rename this SessionVisibleLength and press ahead with CR, and add a task to record SessionLength like @mforns is talking about later.

I still have a few questions around the code being quite complicated for the edge cases and I really wonder if we need to account for those at all in first instance or rather we need to implement the most common case and later look at data to tell us how prevalent other edge cases are.

I agree with this approach, but currently the code actually isn't compensating for any edge case but one. The diagrams in the documentation just explain what happens as a result of those edge cases not being compensated for.

The only edge case it prevents is that it makes sure that when two or more pages are on screen, only one is able to drive the clock. If we remove that, all that will happen is that while the user has two pages open, the clock will run twice as fast, three pages open, three times as fast, etc. You could actually probably make an argument that hey, this makes some kind of sense. I'm open to exploring it.

[marcel] But in the example, the user is reading the wiki page for 20 seconds every 2 minutes, so it has no such inactivity period, right?

ah, yes, you are totally right. My mistake.

[jason] were we trying to capture eyeballs-on-screen-time?

mmm, no, I do not think so. SessionLength is what we are after, please see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength#Session_Length_Metric

SessionLength can be loosely defined as "time between first and last interaction". This is the common definition used by providers of that metric, for example google analytics. I do not think it will be wise to move away from this common understanding of what the metric means. For example, a common boundary for a session is 30 minutes of inactivity.

Please see: https://support.google.com/analytics/answer/2731565?hl=e

Tab browsing brings some challenges that could be implemented tapping into the visibility API but that might not be needed in all implementations. In all cases, I think, you need to deal with the challenge of how "inactivity" is defined.

were we trying to capture eyeballs-on-screen-time?

mmm, no, I do not think so. SessionLength is what we are after, please see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength#Session_Length_Metric

Yep, sounds like some drift happened here on my part. I'll make the necessary modifications and submit a new patchset. We'll see how much this changes the implementation. Thanks @mforns, good eye!

Thanks @jlinehan for the clarification.

I believe (but might be wrong) that the session length metric is a bit simpler to instrument that the screen time one.
Here's a rough implementation idea:

There are 2 cookies, which store the last heartbeat that was sent and its timestamp.

When a page becomes visible:
 - read cookies
 - if cookies do not exist, or timestampCookie is older than 15/30 minutes ago:
     - send a heartbeat 0.
     - update the cookies with: heartbeatCookie=0 and timestampCookie=<now>
 - start a chain of setTimeouts that acts as a clock that ticks every N seconds (N < heartbeatInterval).

When a page becomes hidden:
 - destroy clock

At every clock tick:
 - read cookies
 - determine pending heartbeat events to be sent[1], for each:
     - send pending heartbeat event
 - update cookies[2].

---
[1] Use heartbeatCookie, timestampCookie, heartbeatInterval and <now> to calculate those, i.e.:
Suppose heartbeatCookie=0, timestampCookie=0, heartbeatInterval=10 and <now>=35. We should send the following heartbeat events:
(hearbeat=1, timestamp=10)
(hearbeat=2, timestamp=20)
(hearbeat=3, timestamp=30)
[2] And then update the cookies to: heartbeatCookie=3 and timestampCookie=30

There's still some potential race conditions if the user is visualizing more than one page simultaneously *and* their clocks happen to tick at the exact same time. This case would generate some duplicated heartbeat events. But I believe it would happen very few times, and wouldn't affect the metric significantly.
Thoughts?

As an improvement to the algorithm above, I'd remember we spoke about having exponential heartbeat intervals, so that we send less events for long sessions. Maybe not suited for a first version of the code, but leaving it here as a note. For example, we could use powers of 2 or powers of 3: First heartbeat is sent at T0+3^0=0; next heartbeat is sent at T0+3^1=3, next at T0+3^2=9, next at T0+3^3=27, and so on. We don't need to keep extra state, because the heartbeat itself gives us the value for the next interval. Plus, I believe we don't lose precision, because for long sessions we're not interested if the session is 40min20sec long or 40min40sec long, it's the same. In the back end calculation we only need to know the exponential factor (i.e. 3) to be able to reconstruct the values.

I believe (but might be wrong) that the session length metric is a bit simpler to instrument that the screen time one.

Yes I believe it is, and the algorithm you described should get us there. If any complications come up, I'll follow up here after posting a patchset.

There's still some potential race conditions if the user is visualizing more than one page simultaneously *and* their clocks happen to tick at the exact same time.

Yeah, this is an epsilon IMO, no big deal.

The challenge is how to define inactivity such cookies are "reseted". A window open for the whole night is not a"whole night" session, for example. We need a notion of an "action taken". If the last action taken by the user was, say, less that 30 minutes ago, the session continues to tick. Inactive windows (in which no action has been taken) for, say, 30 minutes, need to 'stop ticking".

There are many gotchas in mobile as windows might never "close" they might be suspended and moved to the background and that, also, indicates a session has ended. As @Krinkle mentioned earlier there are lessons to be learned here from the past.

The challenge is how to define inactivity such cookies are "reseted". A window open for the whole night is not a"whole night" session, for example. We need a notion of an "action taken".

Yes, alas this is the conundrum here. We presently have no programmatic notion of "activity" in place yet. The closest approximation we have is visibility, which as you're pointing out, is not always going to cut it. Certainly if the page is not visible, then no actions are being taken. However it's not true that if the page is visible, an action is being taken. But the Visibility API is more-or-less reliable and this is probably part of why the spec drifted to something like SessionVisibleLength when I was talking with various people.

There are many gotchas in mobile as windows might never "close" they might be suspended and moved to the background and that, also, indicates a session has ended. As @Krinkle mentioned earlier there are lessons to be learned here from the past.

Yeah, the BUOD group is planning to build some notion of session that captures inactivity and the other transitions that cause end-of-session, which we were going to back-port for this instrument to use, but it's not currently there.

I actually drew up a sketch and a patch to do this, I'll find it and post it on this ticket and the dedicated session ticket. But is that something we want to do now? It will block this patch until we can implement it.

If we don't do that, we are faced with a choice between accepting that some sessions will be be really long if people leave their windows open, or to implement SessionVisibleLength, which as pointed out, is not exactly what we intended to capture (though it's worth noting that the backend processing will work the same for any monotonic process).

I have no feelings either way, we can do any of these at the highest possible priority to get the feature shipped.

@Nuria, the algorithm as proposed does stop ticking when the page is not visible.
However, as you pointed out, if someone opens Wikipedia and goes have lunch, we'd be counting a very long session unlawfully (if the screensaver does not trigger the visibility=hidden event).

To solve that, we can drop the clock, and instead check for pending heartbeats whenever we see other events happening, like: mouseHover, mouseScroll, etc. (and of course, page load).

The BUOD group is planning to build some notion of session that captures inactivity, which we were going to back-port for this instrument to use, but it's not currently there.

I do not think a big definition is needed, there are, again, common paradigms for what an "interaction" is. Please see: https://support.google.com/analytics/answer/1033068#NonInteractionEvents

I think we have couple options:

  1. initially implement marcel's suggested algorithm knowing there is this caveat (that on the backend can be suppressed by removing outliers/long tail)
  1. listen to every single event the browser is bubbling, if nothing happened on the last 30 minutes we clear state. This method is, I believe, what is used on GA.

Let's please start a new patch going forward as the history of current patch to date would make things unnecessarily confusing.

The BUOD group is planning to build some notion of session that captures inactivity, which we were going to back-port for this instrument to use, but it's not currently there.

I do not think a big definition is needed, there are, again, common paradigms for what an "interaction" is. Please see: https://support.google.com/analytics/answer/1033068#NonInteractionEvents

Yeah sorry for my loose language, I didn't mean definition, I meant a piece of code that implements the definition in a way that other code doesn't need to use it. For example, something that either resets the sessionId cookie that MediaWiki memoizes, or sends an event to mw.trackQueue signalling that the session has ended, etc., once the conditions for a session refresh were met.

I think we have couple options:

  1. initially implement marcel's suggested algorithm knowing there is this caveat (that on the backend can be suppressed by removing outliers/long tail)

It makes sense to me because it is easily modified to use a more accurate notion of session that may come down the road. We will still be trimming sessions when the pages become hidden for too long, but will have to accept that people who leave a page open for a long time will show a very long session. It's possible that this will make the length distribution look bimodal, and we can just threshold it out if it looks clean enough. I don't know.

  1. listen to every single event the browser is bubbling, if nothing happened on the last 30 minutes we clear state. This method is, I believe, what is used on GA.

I believe that GA actually uses the events that the GA analytics stack is sending as its 'activity' indicators, rather than browser events. I don't think our instrumentation coverage is sufficient to approximate activity very well with this approach though, and it wouldn't be consistent across pages.

If using browser events, the way I usually see it done is that certain events e.g. scroll, mousemove, keydown etc get listened for. But besides keydown, these are pretty high-rate events and this used to be a big no-no, but maybe I'm behind the times. Pinging @Krinkle for his opinion.

If performance with continuously listening is an issue, my idea was to tweak the definition of session slightly from "ends after n minutes of no activity" to "ends the first time a user fails to interact once every n minutes." These are almost but not quite the same. In the second one, a window becomes idle at time kn if, on time interval [(k-1)n, kn], none of the following occur: 'visibilitystatechange', 'scroll', 'keydown', 'mousemove', etc., whatever events you want.

Changing the definition in this way would mean that we only need to attach a handler once every n minutes, and only capture one event before unbinding the handler. I'm sure there are other approaches, though. Again, this seems touchy enough that it could be built in a separate patch, which would then be utilized by this one (and other instruments as well).

I uploaded a new patchset (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/604051/15) implementing the algorithm outlined by @mforns. We have open questions about controlling session duration and, to a lesser extent, client clock shifts which we still need to work through.

@jlinehan; some suggestions

  1. let's start a new patch to avoid unnecessary confusion
  1. let's apply lessons learned in the past. This code needs to work in mobile and desktop. In mobile devices the page visibility API might not work how you would expect. I think the code to "catch" visibility changes needs to be more robust. Please see: https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/5b750c24babd5409f24a7cc4c3d7c8ce5ad0097f#diff-a83abaf29fef6f655aebb83ec875cf76R53

Please consider that for mobile you might need to use the page lifecycle: https://developers.google.com/web/updates/2018/07/nic68#page-lifecycle

Before submitting code let's please make sure it does work on mobile devices that might be sending the page to the background, let's test for mobile explicitly.

@jlinehan; some suggestions

  1. let's start a new patch to avoid unnecessary confusion

I don't mind, but maybe I don't understand what is confusing?

  1. let's apply lessons learned in the past. This code needs to work in mobile and desktop. In mobile devices the page visibility API might not work how you would expect. I think the code to "catch" visibility changes needs to be more robust. Please see: https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/5b750c24babd5409f24a7cc4c3d7c8ce5ad0097f#diff-a83abaf29fef6f655aebb83ec875cf76R53

This is helpful, but is probably better to post as a comment on the patch CR, no?

Before submitting code let's please make sure it does work on mobile devices that might be sending the page to the background, let's test for mobile explicitly.

I agree that we need to do this before *merging* code, and was always my intention. But particularly as we are working through the rough details of how this code functions, and in the interest of getting code into CR faster as per your own feedback, and the code being marked 'WIP,' I think this is not an entirely reasonable expectation. We will of course be testing for mobile explicitly and before anything goes live, will have ensured as many cases as possible are operating as intended.

This is helpful, but is probably better to post as a comment on the patch CR, no?

Indeed! I added comment to gerrit patch.

I don't mind, but maybe I don't understand what is confusing?

Since many of the comments will no longer apply when we rework the algorithm to be the one @mforns suggested earlier (plus the mobile specific constrains) it is probably best to start fresh.

Regarding comments on when to terminate a session:

There are many papers on this for Wikipedia specifically, limits are normally set to 30 mins to an hour to render a session "inactive".

Leila's research uses one hour windows for example: https://arxiv.org/pdf/1702.05379.

I suggest we go first with the more conservative 30 minute approach that GA follows (https://support.google.com/analytics/answer/2731565?hl=en)
that is also present on other Wikipedia papers https://labtomarket.files.wordpress.com/2018/01/wiki.pdf

Change 604051 abandoned by Jason Linehan:
[mediawiki/extensions/WikimediaEvents@master] WIP: Adds sessionTick instrument for measuring session length

Reason:
Trying different approach

https://gerrit.wikimedia.org/r/604051

Change 619479 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] [WIP] Adds sessionTick instrument to measure session length.

https://gerrit.wikimedia.org/r/619479

I don't mind, but maybe I don't understand what is confusing?

Since many of the comments will no longer apply when we rework the algorithm to be the one @mforns suggested earlier (plus the mobile specific constrains) it is probably best to start fresh.

Made a new patch. I also added back the checks for visibilitychange and document.hidden vendor prefixes that had been removed in a prior CR.

I suggest we go first with the more conservative 30 minute approach that GA follows (https://support.google.com/analytics/answer/2731565?hl=en)

I set it to 30 minutes, and made it a configuration variable so we can decide differently later.

I left notes on the patch, the algorithm works fine but we need to focus in on performance for detecting page activity in order to idle the page. Once we get that sorted out, we should be close to done.

Hi @jlinehan!
Here's the continuation of the Gerrit discussion about using seconds elapsed instead of ticks, a.k.a. msInterval=1000.

My main point is in response to your comment:

Sending elapsed seconds rather than ticks will mean discarding the histogram-like properties of the data structure on the backend.

I think that is not true. In the back-end we can translate seconds-elapsed measurements to ticks or histogram-like structures any time.
For instance: Imagine a client has sent 3 session interval events:

(startSecond=0, endSecond=45)
(startSecond=45, endSecond=326)
(startSecond=326, endSecond=1729)

If we assume tickInterval=60sec, then we can easily transform the intervals above into ticks, by outputting the ticks each interval comprises:

(startSecond=0, endSecond=45)      => []
(startSecond=45, endSecond=326)    => [t1, t2, t3, t4, t5]
(startSecond=326, endSecond=1729)  => [t6, t7, ..., t28]

A way of calculating that could be:

tick_interval = 60

def interval_to_ticks(start_second, end_second):
    first_tick = start_second / tick_interval + 1
    last_tick = end_second / tick_interval
    for i in range(first_tick, last_tick + 1):
        yield i

I believe moving the bucketing of time-to-ticks from the client to the back-end is an advantage.
It allows us to change the value of tick_interval and apply it retroactively to the whole data set,
without having to change instrumentation, or maintain different versions of the data.
This way, we could even have different studies on the same data set with different tick_interval values each.

Now, please disagree, if this is sounds like over-engineering or unnecessarily complicated.

Quick thoughts, let's discuss more today though

  1. I think perhaps the presence of tick_interval is confusing here. In practice, we are not going to be changing that value once we determine the "right value". The reason a field exists in the schema to hold this value is purely defensive. I think it could be removed, and if it ever needs to change, it could get added as a field with a new version of the schema.
  2. Given that, the ability to change the 'bucket size' seems like an unnecessary computation. By that I mean, I'm not sure what value that creates or what problem it solves.
  3. By the histogram-like properties that are nice, I mean that each client reports a simple integer counting process starting from 0. It is true that you can transform data from the structure you listed (and many other structures) into histogram data on the backend. As I said on the patch, they are isomorphic. But again, I don't see what advantage there is to doing this on the backend.
  4. Perhaps the thinking here is that if you have multiple same-origin pages, they don't have to interact and can simply transmit the current timespan since the cookie's last send timestamp? But that will result in more events being sent, not less. We can accomplish the same time resolution by simply subtracting the remainder of the floor operation from the timestamp applied to the cookie?

@jlinehan

  1. I think perhaps the presence of tick_interval is confusing here. In practice, we are not going to be changing that value once we determine the "right value". The reason a field exists in the schema to hold this value is purely defensive. I think it could be removed, and if it ever needs to change, it could get added as a field with a new version of the schema.

I just thought determining the right value could take a while, and could happen after we already started collecting events. Nevertheless, you're probably right, and we'll never change that value. No strong opinions about this.

  1. Given that, the ability to change the 'bucket size' seems like an unnecessary computation. By that I mean, I'm not sure what value that creates or what problem it solves.

If we never wish for another tick_interval, for either the session length metric or other research/analysis, then the value will be none, I agree. And again, we probably never will :]
Re. "unnecessary computation": I believe it's not unnecessary. It needs to be done either in the client or the back-end. My suggestion was simply moving it to the back-end, no extra computation.

  1. By the histogram-like properties that are nice, I mean that each client reports a simple integer counting process starting from 0. It is true that you can transform data from the structure you listed (and many other structures) into histogram data on the backend. As I said on the patch, they are isomorphic. But again, I don't see what advantage there is to doing this on the backend.

I mentioned that in Gerrit CR: The advantage I see is: interval events will consume less bandwidth and less storage space for the case where the session has long periods where the page is hidden. Because instead of sending one event for each pending tick, we send just one interval event. I think this is very tangible, but it might end up being negligible in practice, depending on the ratio of sessions with long "hidden periods". So, once more, no strong position here!

  1. Perhaps the thinking here is that if you have multiple same-origin pages, they don't have to interact and can simply transmit the current timespan since the cookie's last send timestamp? But that will result in more events being sent, not less. We can accomplish the same time resolution by simply subtracting the remainder of the floor operation from the timestamp applied to the cookie?

I didn't think of this when proposing the idea. In my head, both approaches (ticks and intervals) have the same functioning client-wise, the only difference being the way the data is encoded in events.
Instead of iterating over pending ticks, just send one event, that's it. It's not possible that this approach sends more events than before.

Overall, I feel like the advantages I saw when proposing this solution are not as big as I thought.
So, unless someone else has comments, I think we can move on with the current approach! :]

I mentioned that in Gerrit CR: The advantage I see is: interval events will consume less bandwidth and less storage space for the case where the session has long periods where the page is hidden. Because instead of sending one event for each pending tick, we send just one interval event. I think this is very tangible

This I agree with 100%, but I'd sooner see that realized as a uniform policy related to how the HTTP request maker is drawing down the output queue -- namely, having it batch multiple events together in the same request (which EventGate supports). That way we don't have an optimize things like this per-instrument. (see https://phabricator.wikimedia.org/T239996)

Change 627839 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[schemas/event/secondary@master] session_tick: modifies properties and updates descriptions

https://gerrit.wikimedia.org/r/627839

Change 627839 merged by Bearloga:
[schemas/event/secondary@master] session_tick: modifies properties and updates descriptions

https://gerrit.wikimedia.org/r/627839

Change 619479 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Adds sessionTick instrument to measure session length.

https://gerrit.wikimedia.org/r/619479

Change 631198 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] sessionTick: Correct invocation of mw.config.get

https://gerrit.wikimedia.org/r/631198

Change 631198 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] sessionTick: Correct invocation of mw.config.get

https://gerrit.wikimedia.org/r/631198

The instrument is now merged, thanks to @Krinkle @Gilles @Nuria @mforns @Mholloway for providing such detailed and thoughtful code review. Next up will be a patch to enable on an internal wiki, perhaps OfficeWiki, and we can study the results, tune the various constants, and go from there.

Be sure to enable it on Beta Cluster first as well so that it remains in sync with prod, and to allow for early feedback every week based on how things go to prod the week after. E.g. due to changes to the instrumentation itself, or something that else that might interact with it accidentally or otherwise.

I believe this issue is due to the event being invalid due to the fact that the values of members of the tags object must be strings per the schema, but we were providing a boolean value for is_logged_in. Being mindful of https://phabricator.wikimedia.org/T217142#5295424, plus the fact that there is already a 'tags' field being used in logstash somehow, I've updated the schema to have a top-level is_logged_in field, which takes a boolean value.

Yeah, log messages are expected to use a flat structure for their context key-value pairs. This is fairly straight-forward when writing logger->debug(string, hash) in an application. Not so obvious when piping EventGate into Logstash. The main issue there is a type conflicts in the Elasticsearch index.

In the same vain, we'll probably want to rename test: { supports_passive: Boolean } } as well.

Hm, however these aren't just 'log messages', they are structured events. This is a session tick event, which aren't going to logstash.

However, more generally, in mediawiki/client/error we have a tags field, which clearly conflicts with logstash. We'd like to continue using a nested map field like this so that clients can add some extra information, without us having to add context specific top level fields to the schema (like is_centralnotice_banner_shown).

As far a I can tell nested fields work fine in logstash. I understand that tags will cause a problem because of the type difference in the elasticsearch index, but can we just put this data in a newly named field that doesn't conflict? Perhaps context_tags or something?

Be sure to enable it on Beta Cluster first as well

Is there anything blocking this at this point?

Change 636659 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Remove http.client_ip from session_tick without bumping version

https://gerrit.wikimedia.org/r/636659

Change 637539 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[operations/mediawiki-config@master] sessionTick: Add event stream and enable on officewiki

https://gerrit.wikimedia.org/r/637539

Change 636659 merged by Ottomata:
[schemas/event/secondary@master] Remove http.client_ip from session_tick without bumping version

https://gerrit.wikimedia.org/r/636659

As part of trying to:

  • Create a clearer understanding of what work has been done towards Session Length
  • Ensure that there is a history of past work, comments and commits
  • Understand what more is needed to achieve our end goal of measuring and view how long users engage with our products

I am resolving this and pointing to https://phabricator.wikimedia.org/T267494 as the newly defined parent task.

Change 637539 merged by jenkins-bot:
[operations/mediawiki-config@master] sessionTick: Add event stream and enable on officewiki

https://gerrit.wikimedia.org/r/637539

Change 646810 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/WikimediaEvents@master] Export $wgWMESessionTick to JS (the old way)

https://gerrit.wikimedia.org/r/646810

Change 646810 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Export $wgWMESessionTick to JS (the old way)

https://gerrit.wikimedia.org/r/646810

Change 647108 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] WikimediaEvents: Enable SessionTick on group0

https://gerrit.wikimedia.org/r/647108

Change 647108 merged by jenkins-bot:
[operations/mediawiki-config@master] WikimediaEvents: Enable SessionTick on group0

https://gerrit.wikimedia.org/r/647108

Mentioned in SAL (#wikimedia-operations) [2020-12-08T23:24:34Z] <mholloway-shell@deploy1001> Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Enable SessionTick on group0 T248987 (duration: 02m 00s)

Change 647257 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/WikimediaEvents@master] sessionTick: record config variables

https://gerrit.wikimedia.org/r/647257

Change 649740 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] WikimediaEvents: Promote SessionTick to group1

https://gerrit.wikimedia.org/r/649740

Change 649740 merged by jenkins-bot:
[operations/mediawiki-config@master] WikimediaEvents: Promote SessionTick to group1

https://gerrit.wikimedia.org/r/649740

Mentioned in SAL (#wikimedia-operations) [2020-12-15T22:10:46Z] <mholloway-shell@deploy1001> Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Promote SessionTick to group1 T248987 (duration: 01m 04s)

Change 647257 abandoned by Bearloga:
[mediawiki/extensions/WikimediaEvents@master] sessionTick: record config variables

Reason:

https://gerrit.wikimedia.org/r/647257