Page MenuHomePhabricator

Activity session ID seems to persist too long in some cases
Open, Needs TriagePublic

Description

The wikistories_consumption_event stream contains an activity_session_id field populated using mw.eventLog.id.getSessionId. This session ID is supposed to expire after 30 minutes of inactivity on the project.

While doing a quality check on the stream (T312262), I noticed that some sessions seem to persist much too long. Here are the ten longest in my dataset (where duration is measured by time between the first and last event with the same session ID).

approx_durationevent_count
2 days 07:20:38.8200005
1 days 05:31:53.3210002
1 days 00:59:31.68800013
0 days 21:30:48.00300034
0 days 09:03:21.2140005
0 days 08:37:44.95800018
0 days 06:59:10.0500002
0 days 05:47:38.34400011
0 days 05:20:11.3440002
0 days 05:13:03.42000027

Looking at the individual events, these all seem to follow the same pattern: all but one event in very plausible span of time (about 30 minutes or less), and then a final event many hours later. The page of the final event is usually but not always a page that has appeared previously in the session. This stream only includes events relating to Wikistories, a new beta feature on the Indonesian Wikipedia, so I expect to see break of longer than 30 minutes between events without a new session ID. However, this pattern involves breaks so long they're completely implausible.

After some digging, I found that 2 hr, 30 min is a rough dividing line between long sessions that seem correct (with lots of events spread evenly across the duration) and long sessions that seem incorrect (following the pattern of a plausible session followed by a single extremely delayed event). In my dataset of 942 sessions, 3.1% were longer than this cutoff, suggesting that the incidence of this issue is small but not negligible.

The pattern makes it seem like the session ID is getting reset after a long interval of inactivity, but not immediately, so there's one event with the old ID before a new one is generated.

For context, the wikistories_consumption_event stream is the first to use this session ID (the ID is generated by not actually logged by the session_tick instrumentation). This previously led to the discovery of one major bug with this session ID: T314622.

Event Timeline

@phuedx interested in your thoughts on this 😊

nshahquinn-wmf renamed this task from Session tick session ID seems to persist too long in some cases to Activity session ID seems to persist too long in some cases.Nov 3 2022, 6:35 PM

I'm sorry for not having responded earlier.

I was reminded of this task when I was looking at https://wikitech.wikimedia.org/wiki/Analytics/Sessions#Web:

mw.eventLog.id.getSessionId() is a newer, slightly "smarter" session identifier, which is barely used. This session identifier resets after 30 minutes of inactivity. It does this by subscribing to a session reset event emitted by the SessionTick instrument, which collects data for the SessionLength dataset and is the instrumentation that monitors activity (scrolling, clicking, or typing). However, because session management is done by the SessionTick instrument, this can lead to race conditions during instrument initialization – resulting in instruments using a stale session ID before a refresh is triggered.

I believe that last sentence explains what you saw in your dataset, @nshahquinn-wmf.

As for how to fix it, I think there are a couple of options:

  1. Promisify mw.eventLog.id.getSessionId(): Make mw.eventLog.id.getSessionId() return a promise that resolves with the session ID after the SessionTick instrument has fully initialised
  2. Merge SessionTick instrument into EventLogging: The SessionTick instrument has a few parts – the state manager, the regulator, and the sampling event submitter. We could move the state manager and the regulator into EventLogging, leaving the sampling event submitter in WikimediaEvents

Essentially: Make the instrument more complicated or make EventLogging more complicated. I haven't yet thought through the implications of the latter.


Related to 1 above: I wonder if EventLogging could be updated to accept a promise that resolves to an event, i.e.

type Event = Record<string, any>;

namespace mw {
  interface eventLog {
    submit( streamName: string, event: Event ): void;
    submit( streamName: string, Promise<Event> ): void;
  }
}

🤔

We defined how to measure an active session. We also defined and implemented the length of a session. And all of that is in the same codebase.

For reasons we put that definition in one codebase (catch-all bucket codebase for things that are wikimedia specific that will be in production but not be in third-party wikis). Then we said "we want a notion of a session, we want it to be available in the place that we get all other data from". Talking between codebases introduces a coordination issue. Making EventLogging rely on WMEvents is not acceptable (EL is deployed elsewhere/generic, WME is specific to the foundation).

In some circumstances, the thing that governs whether or not a session is "active" loads after the thing that tries to measure the session.

This has become an issue now because we recommend to record activity_session_id as part of the base stream.

  • Why did we choose to include this in our base stream?
  • Who uses it/when?
  • How frequently is this occurring and what is the impact on the data and decision making capacity?

This task came to the fore again after I noted the uptick in validation errors for the performer.active_browsing_session_token property of experiment-related analytics events. To contextualise that uptick, I've computed rate of validation errors for that property:

+-------------------------+----------+----------+------------------+
|     instrument_name     | n_errors | n_events | n_errors_percent |
+-------------------------+----------+----------+------------------+
| LoggedOutRetentionVisit |     1944 |  5995674 |             0.03 |
| PageVisit               |      119 |  1135175 |             0.01 |
+-------------------------+----------+----------+------------------+

WITH
  errors AS (
    SELECT
      CAST( json_extract( raw_event,'$.instrument_name' ) AS VARCHAR ) AS instrument_name,
      COUNT( * ) AS n
    FROM
      event.eventgate_analytics_external_error_validation
    WHERE
      year = 2025
      AND month = 8
      AND day >= 8 AND day <= 10
      AND errored_stream_name = 'product_metrics.web_base'
      AND strpos( message, '.performer.active_browsing_session_token' ) > 0
    GROUP BY
      1
  ),
  pmwb_events AS ( -- "events" causes the Presto query parser to fail
    SELECT
      instrument_name,
      COUNT( * ) AS n
    FROM
      event.product_metrics_web_base
    WHERE
      year = 2025
      AND month = 8
      AND day >= 8 AND day <= 10
    GROUP BY
      1
  )

SELECT
  errors.instrument_name,
  errors.n AS n_errors,
  pmwb_events.n AS n_events,
  ROUND( 100 * CAST( errors.n AS DOUBLE ) / ( errors.n + pmwb_events.n ), 2 ) AS n_errors_percent
FROM
  errors
    JOIN pmwb_events ON errors.instrument_name = pmwb_events.instrument_name
;

This is blocked by decommissioning of session length v1 and ladders up into simplifying event logging. (We should bring the regulator into xLab to add session length as a contextual attribute)