Page MenuHomePhabricator

Consider simplifying EventLogging dependency on mw.user's random ID generation
Closed, ResolvedPublic

Description

Currently, EventLogging does not perform any actions related to the generation, storage, or attachment of tokens such as session and pageview identifiers directly. That responsibility is largely delegated to mw.user, in the form of mw.user.sessionId and mw.user.getPageviewToken, and, sometimes, mw.user.generateRandomSessionId. EventLogging achieves its sampling with particular defaults, but also allows instrumentation or other calling code to override with an explicit token, often generated from the same methods in mw.user, but whose persistence is managed differently, depending on the needs of the instrument.

What we've found is that this has led to many different strategies for managing tokens and their lifetimes, along with different formats for these tokens, different field names in event schema that hold these tokens, wide inconsistency in how sampling occurs, and a lack of understanding for the subtleties both of the browser's persistence model, and how the sampling strategy changes depending on how the token is refreshed.

For these reasons, in Product's re-think of our instrumentation client library, we opted to make the handling of pageview and session identifiers opaque to the caller, something that took place and was managed internally within the library. That allows for:

  • Uniform field names, since the library will always set the value on the same field
  • Uniform ID formats, since the library will always be in control of the format
  • Uniform sampling strategy, since the library will always be in control of the sole determinant of the sampling outcome -- the pseudorandom identifier
  • Caller code does not need to manage associative identifiers on a per-instrument basis, which proliferates cookies, storage keys, etc.

Obviously there will be cases where it is advantageous for tokens to be managed using a different strategy that the library may not support. But in that case I would argue for a severe examination of the needs in order to determine whether this is truly necessary. If it is necessary, we should explore ways to acheive their needs from within the library, and update the API accordingly. The current level of delegation is causing a maintenance problem.

Proposal
  1. Factor out and parameterize token-generating logic from mw.user.generateRandomSessionId(void) to mw.util.generateRandomToken(int num_bytes)
  2. Make mw.user.generateRandomSessionId(void) call mw.util.generateRandomToken(80), this is retained for compatibility with all existing code that relies on it, but its deprecation should be noted in the documentation

This enables MEP Client Library (and others†) to call mw.util.generateRandomToken for its identifier needs without depending on mw.user

†: For example, this function in WikimediaEvents::searchSatisfaction.js would be unnecessary if mw.util.generateRandomToken(int num_bytes) existed:

/**
 * Generate a unique token. Appends timestamp in base 36 to increase
 * uniqueness of the token.
 *
 * @return {string}
 */
function randomToken() {
  return mw.user.generateRandomSessionId() + Date.now().toString( 36 );
}

Event Timeline

mpopov added a subscriber: Krinkle.

@jlinehan @Krinkle: updated the proposal in the description; what do you think?

mpopov renamed this task from Consider simplifying EventLogging dependency on mw.user to Consider simplifying EventLogging dependency on mw.user's random ID generation.Feb 27 2020, 8:57 PM

For all intends and purposes, mw.user.generateRandomSessionId is indeed a utility method for generating random tokens and always has been. It was specifically created for EventLogging-related purposes and not used for anything else as far as I'm aware. When these were added, we decided to host them under "mw.user", I guess because there is a cookie/session related aspect to them, and because hash size (and uniqueness) we provide there is closely linked with our knowledge of how many uses we have etc.

I admit it's not a very strong argument, but I also don't yet see a strong argument for why "util" would be semantically a better fit. If anything, "util" is the definition of not having a good place, as it's a last resort for unassorted stuff. Note that both of these classes are provided by MediaWiki core and generally loaded on all page views. Given mwext-EventLogging is a MediaWiki extension and a dependency on core is effectively the lightest form of dependency one can have, both seem equally good to me. Why is a dependency on "mw.util" considered better than a dependency on "mw.user"?

In my experience, renames like this tend to cause short-term maintenance work and medium-term migration/confusion. I generally discourage that unless there's a stronger rationale (e.g. I reason that will likely help dismiss a proposal to rename the method back in the other direction two years from now), and a worthwhile benefit that justifies the maintenance work and mid-term confusion.

The point about WikimediaEvents's searchSatisfaction randomToken function no longer being needed sounds interesting. Can you elaborate on why it would no longer be needed?

Thank you for the very thoughtful and detailed reply, @Krinkle! I will have to discuss this with @jlinehan. Would it be okay, then, to parameterize mw.user.generateRandomSessionId and provide a default value for the length parameter? This would make it compatibility with everything that calls it as is, and enable us to use it in the Modern Event Platform Client Library without having to call it twice and then trim the second output. (Which we can do, it just wouldn't look very elegant or nice.)

The point about WikimediaEvents's searchSatisfaction randomToken function no longer being needed sounds interesting. Can you elaborate on why it would no longer be needed?

I was just thinking that they would be able to replace calls to randomToken with mw.user.generateRandomSessionId(128) or mw.user.generateRandomSessionId(256) to get the length that satisfies their non-collision requirements.

†: For example, this function in WikimediaEvents::searchSatisfaction.js would be unnecessary if mw.util.generateRandomToken(int num_bytes) existed:

I think we are going to have the discussions about defaults. On my opinion the random generation should always provide a suitable default, documented as such. As it does now the entropy default is 2^80 which is suitable for all uniqness use cases to date. Code like the following:

return mw.user.generateRandomSessionId() + Date.now().toString( 36 );

speaks more as to developers not calculating correctly the entropy needed for the use case at hand cause, as it is, it does not require a uniqness beyond 2^80. Makes sense?

I think we are going to have the discussions about defaults. On my opinion the random generation should always provide a suitable default, documented as such. As it does now the entropy default is 2^80 which is suitable for all uniqness use cases to date.

Oh absolutely! And I suggested using that as the default value – both for ease of use and for compatibility with all the existing calls to it.

Oh absolutely! And I suggested using that as the default value – both for ease of use and for compatibility with all the existing calls to it.

What other cases there are hat are not satisfied by the 2^80 entropy case?

Looked at this closer, this looks like it's fine for now. We can revisit during patch phase 4 of EventLogging (T238544) and decide as needed then.

Seeing as we've considered it and have come to the conclusion in March that nothing more needs to be done, we can close this out.