Page MenuHomePhabricator

Provide standard/reproducible way to access a PageToken
Closed, ResolvedPublic5 Story Points

Description

[update 2018-08-21]
We are in agreement to increase the length of mw.user.generateRandomSessionId to 80 bits
@Nuria will be changing all session identifiers for mediawiki and that same code will be used in the existing mw.user.getPageviewToken
Nothing left from web-side other than to review that patch when it arrives. Blocked until then.


During T191532 it was flagged by Tilman that we want the pageToken for the PageIssues schema to match the pageToken for the ReadingDepth schema. Apparently this made analysis of Page previews difficult, and we're keen to make it easier for analysis of PageIssues (allows joining both tables in the queries later)

Right now, unlike the user session token, the page token is generated like so:

In WikimediaEvents:

user.generateRandomSessionId() +
                                Math.floor( mw.now() ).toString() +
                                user.generateRandomSessionId(),

Given the token is tied to the current time and 2 random user session tokens, this means the pageToken will change every time the page is refreshed or revisited.
Note: Per T191248 the 2nd random session ID is likely redundant.

In Minerva page issues test:

user.generateRandomSessionId() +
                                        Math.floor( mw.now() ).toString()

Current proposal

Add a getter for the private pageViewToken property defined in the ext.eventLogging.subscriber module (see T201124#4478198 for more detail).

Old proposals

1

In core, we'll add the mw.Title.token method
This will be cached after first run, meaning that pageToken will be consistent across all our schemas.

var token;
mw.Title.token = function () {
   if ( token) return token;
else {
token = user.generateRandomSessionId() +
                                        Math.floor( mw.now() ).toString();
return token; }
};

Acceptance criteria

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 451885 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/EventLogging@master] New: add mw.eventLog.pageviewToken and newPageInteractionToken()

https://gerrit.wikimedia.org/r/451885

Change 451886 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/WikimediaEvents@master] Update: reading depth to use eventLog.pageviewToken()

https://gerrit.wikimedia.org/r/451886

Add a getter for the pageViewToken property defined in the ext.eventLogging.subscriber module.

Done in https://gerrit.wikimedia.org/r/451885 as mw.eventLog.pageviewToken() and mw.eventlog.newPageInteractionToken(). The former is now used by reading depth (see below remark) and can be used by Popups' pageToken and that latter can be used by Popups' linkInteractionToken and Minerva's page issues.

Update WikimediaEvents to use the new method

@Jdlrobson, @ovasileva, this is now done for reading depth: https://gerrit.wikimedia.org/r/451886. There are a number of other mw.user.generateRandomSessionId()-like calls from different stakeholders but it's not clear if changes are wanted as part of this task. For example, ext.wikimediaEvents.citationUsage.js uses a "page_token" field that directly calls mw.user.generateRandomSessionId() and ext.wikimediaEvents.geoFeatures.js saves a similar token with timestamp to storage for later retrieval but I don't want to damage other user's logging so I recommend new tasks given the diversity.

Update Minerva to use the new method
When refactoring, please keep in mind, that in Page previews repo, when we create a pageInteractionToken -> this token has to be unique for each preview. We cannot use the same token if the user dwells over the same link twice. Please DOCUMENT this in page previews as it's not correctly documented. The goal is to distinguish between two hovers on the same link.

Not done yet for Minerva or Popups. There's new discussion on this task and it's unclear whether the current proposal is wanted or not.

Active schema documentation (https://meta.wikimedia.org/wiki/Schema:ReadingDepth + https://meta.wikimedia.org/wiki/Schema:PageIssues) and mw:Reading/Web/Quantitative_Testing are updated with more guidance around pageViewToken,

Not done yet. I'd like to update the field from pageToken to pageviewToken as well if that won't cause problems. Will update pending discussion conclusion.

@Ottomata, thanks for the note! Should we move forward with the proposal in EventLogging or wait for the up and coming? I don't want to increase tech debt by trying to eliminate it the wrong way. I'm equally happy to move forward with or abandon it.

Add a getter for the pageViewToken property defined in the ext.eventLogging.subscriber module.

Done in https://gerrit.wikimedia.org/r/451885 as mw.eventLog.pageviewToken() and mw.eventlog.newPageInteractionToken().

Can someone now document the difference between mw.eventLog.pageviewToken() and mw.eventlog.newPageInteractionToken() (how is each calculated, and when, and how long does it persist)?

The former is now used by reading depth (see below remark) and can be used by Popups' pageToken and that latter can be used by Popups' linkInteractionToken and Minerva's page issues.

I'm not sure I understand. The main direct goal of this task is that on the same pageview, the pageToken field in PageIssues events should have the same value as the the pageToken field in ReadingDepth (so that the resulting tables can be joined during analysis). How is that achieved by basing each on different functions?

And to expand on T201124#4494829 to hopefully make the point more clearly: I also do not understand this sentence from the task description:

in Page previews repo, when we create a pageInteractionToken -> this token has to be unique for each preview.

Why does it have to be unique for each preview? We (hopefully) didn't calculate the linkInteractionToken value in Popups from a per-pageview token alone, did we?

We cannot use the same token if the user dwells over the same link twice.

Well yes, that has been stipulated in the Popups schema description since April 2016:
"One-time token representing interaction with a link. Generated fresh for each link dwell, even successive dwells on the same link."
@Jdlrobson , does "Please DOCUMENT this in page previews as it's not correctly documented" in the task description refer to comments in the page previews code instead?

Jdlrobson updated the task description. (Show Details)Aug 11 2018, 1:27 AM
Nuria added a subscriber: Nuria.Aug 11 2018, 9:00 PM

I'm not sure if the timestamp is really needed, but if so, I suppose we could add the last 9 digits of the timestamp (would add ~11 days of distribution at millisecond precision).

It shouldn't be , the random token generation is unique to 500+ million.

Niedzielski added a comment.EditedAug 13 2018, 1:20 PM

Can someone now document the difference between mw.eventLog.pageviewToken() and mw.eventlog.newPageInteractionToken() (how is each calculated, and when, and how long does it persist)?

I want to clarify that these new functions are in code review and not yet merged.

newPageInteractionToken() is a proposed function to generate a new token. The token is calculated as a new random session ID prepended to the last 9 digits of the mw.now() timer, excluding the fractional part. It's uncached and generates a new token each time it's called but a caller may save the result if wanted. I was thinking that callers could use it when they wanted a unique token to track a metric like reading depth or page preview link hovers and hoped callers could then avoid having to redefine their own token concept.

pageviewToken() is the proposed accessor function for the cachaed pageview token. The token is the stored result of a prior call to newPageInteractionToken() and is regenerated on each pageview. An example usage of this function is the page_token used in the reading depth schema which was previously defined as new random session ID prepended to all non-fractional digits of mw.now() prepended to another new random session ID.

I'm not sure I understand. The main direct goal of this task is that on the same pageview, the pageToken field in PageIssues events should have the same value as the the pageToken field in ReadingDepth (so that the resulting tables can be joined during analysis). How is that achieved by basing each on different functions?

This task has multiple parts as I understand it. The first is in EventLogging and requires building a general purpose function for accessing a pageview token that is refreshed on each new view but otherwise cached, that's pageviewToken(), and another function to generate new tokens for other page interactions, that's newPageInteractionToken(). Another part is to actually make use of these functions in other repos such as Popups and MinervaNeue to eliminate redundant code and similar but custom token concepts as well as actually unify pageview tokens to reference a single value, pageviewToken().

We (hopefully) didn't calculate the linkInteractionToken value in Popups from a per-pageview token alone, did we?

No, we generate a new random session ID for each interaction. It's only tied to the pageview token in the sense that they're generated using the same random session ID function.

@Tbayer, the following came up in @Krinkle's code review and I was hoping to solicit your guidance:

Would be good to clarify whether or not the Popups use case is meant to use this, and if so, whether (per @Tbayer on the task) whether it indeed expects a new token for each hover event, because if so, it probably wouldn't need a method as it would call generateRandomSessionId() directly in that case. Or is this about the timestamp part?

Popups currently uses generateRandomSessionId() directly. The patch for this task defines a new "page interaction token" concept via newPageInteractionToken() but, per @Nuria and @Krinkle's comments, I don't know if it's wanted. It's identical to the current approach in Popups which is to call generateRandomSessionId() except that it also appends a timestamp too. Do you share @Nuria's perspective that generateRandomSessionId() alone should be adequate and we can omit the timestamp?

This also touches on a question you asked earlier:

in Page previews repo, when we create a pageInteractionToken -> this token has to be unique for each preview.

Why does it have to be unique for each preview? We (hopefully) didn't calculate the linkInteractionToken value in Popups from a per-pageview token alone, did we?

Are you saying that the token generated is random but not necessarily unique? It _is_ intended to be regenerated on each preview, right?

Milimetric moved this task from Incoming to Radar on the Analytics board.Aug 13 2018, 3:28 PM

@Tbayer and I discussed and reviewed. Here are our conclusions:

  • For page issues reading depth specifically, generateRandomSessionId() would be adequately random. However, the solution proposed in the outstanding patch is general purpose and would be applied to other metrics so the timestamp _is_ wanted.
  • pageviewToken() can be used by Popups (and other schemas) needing to share the same token associated with a pageview.
  • newPageInteractionToken() can be used by Popups (and other schemas) needing to generate a new unique token associated with a given pageview; naming suggestions are welcome.

The behavior appears to be correct too:

  • pageviewToken() is refreshed on a new pageview and otherwise cached.
  • newPageInteractionToken() always generates a new token. This function is also used to initialized _the_ pageviewToken.
  • pageviewToken() is initialized prior to usage.

The intended outcome of these changes are:

  • Page interaction token generation code is consolidated to a single place.
  • Existing and future page interaction token generation code uses the new definition and therefore doesn't need to be reevaluated for analytical correctness again and can also become part of the analytical nomenclature,
  • The upcoming page issues A/B test has a way to track session depth using the new pageviewToken() API (and if we ever use the Popups schema again, it can use it too).

The remaining open questions and action items are:

  • How many places do we want to refactor to use the new code as part of this task? There are a number of mw.user.generateRandomSessionId()-like calls from different stakeholders so my suggestion is to either spin off new subtasks for each consultation with each or do nothing. ReadingDepth is done but I intend to do Popups and Minerva as part of this task.
  • Update the schema docs (and maybe the fields to use pageviewToken?).
  • @Ottomata, it's not clear if we should make these changes now to EventLogging or wait for the new implementation you mentioned. Any recommendation? /cc @ovasileva

@Tbayer and I discussed and reviewed. Here are our conclusions:

  • For page issues reading depth specifically, generateRandomSessionId() would be adequately random. However, the solution proposed in the outstanding patch is general purpose and would be applied to other metrics so the timestamp _is_ wanted.

To add more detail regarding the rationale here: As alluded to above and stated in the code, the 64 bits of entropy from generateRandomSessionId() are sufficient to prevent token collisions with ~99% probability in a sample of 500 million, which is enough for the upcoming page issues A/B test which is envisaged to feature up to >100 million pagetokens per project. That said, we also want to keep this future-proof so that users don't even have to bother with such calculations, and it's not totally inconceivable that it could on occasion be used for larger sets too - e.g. the Virtualpageview table currently consists of 7.5 billion link interactions (it doesn't use a token for them and is only an auxiliary schema feeding into the actual aggregate table used for analysis, but this illustrates the possible dimensions). That's a motivation for continuing to use the timestamp as a second source of randomness. However, also adding a second call to generateRandomSessionId() seems overkill indeed.

(Also, the same code documentation warns that generateRandomSessionId()'s "uniqueness should not be depended on unless the browser supports the crypto API", but I haven't researched how much of a restriction that is in practice.)

@Ottomata, it's not clear if we should make these changes now to EventLogging or wait for the new implementation you mentioned. Any recommendation? /cc @ovasileva

Discussed with @phuedx and we shouldn't wait.

Yeah that makes sense. Likely the logic you guys come up with here will be usable for whatever the new systems are too, and we can adapt/port it over. It will be a while before we get to work on this part of the system.

@Tbayer Is it important for the consumption of the Popups schema and other event data that the timestamp is included in the token? It might be easier, as well as simpler and more secure, if this is done during the consuming of the data.

The timestamp recorded in the EventLogging capsule comes from our own infrastructure, rather than from user's devices. And I would assume that when querying the data to make a correlation, that we already put some boundary on how far back in time events are correlated with each other. For example, within a week.

If that is the case, then you effectively already have a random-token that is partitioned by time; That might make adding (part of) a timestamp client-side redundant.

The reason I am asking this is because:

  1. It seems Analytics and Research did not request the addition of the timestamp.It was merely added as a precaution to some of the campaigns, possibly due to a misunderstanding. If you and Nuria both believe it isn't needed, we should consider removing it from the campaigns that have it, instead of standardising it.
  2. The addition of a utility to create a "token plus partial timestamp" complicates the API of the EventLogging client library – by having two similar concepts called "token", instead of one. These kinds of things add up and make our software harder to learn and understand. We should not do this unless we are absolutely sure that it adds value.

Change 453077 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/EventLogging@master] New: add mw.eventLog.newPageInteractionToken()

https://gerrit.wikimedia.org/r/453077

Hi @Krinkle, I appreciate the goal of keeping the code as simple and streamlined as possible. That said, I'm not sure I fully understand the proposed alternative:

@Tbayer Is it important for the consumption of the Popups schema and other event data that the timestamp is included in the token?

From this question, it is not clear to me whether you had already read T201124#4503440? (Happy to dig deeper into the rationale as outlined there if needed - just want to make sure we are at the same point of the conversation.)

It might be easier, as well as simpler and more secure, if this is done during the consuming of the data.

By "consuming of the data", do you mean the analysis step (running queries etc.), or the ingestion of events by the servers? And does "easier, as well as simpler and more secure" refer to the work of analysts, or to the coding effort developers need spend on this task?

The timestamp recorded in the EventLogging capsule comes from our own infrastructure, rather than from user's devices.

Yes, I am aware of that. How is it relevant for the question at hand?

And I would assume that when querying the data to make a correlation, that we already put some boundary on how far back in time events are correlated with each other. For example, within a week.
If that is the case, then you effectively already have a random-token that is partitioned by time; That might make adding (part of) a timestamp client-side redundant.

I am not sure I understand. What does it mean to have a token partitioned by time? And are you assuming that events with the same pageToken will always have the same EL timestamp?

Change 451885 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] New: add mw.eventLog.getPageviewToken()

https://gerrit.wikimedia.org/r/451885

Krinkle added a comment.EditedAug 16 2018, 7:02 PM

@Tbayer If I understand correctly, you're stating that the current token is unique enough for our purposes. That's good to hear and obsoletes most of my other questions.

If at some point it becomes a concern, either due a higher insertion rate, or due to the data set we have steadily increasing over time; I was thinking how we could mitigate that. It seems we both think we can mitigate that by doing what I call "fragmenting the token by time", that is, reducing the scope over which the token is expected to be unique from "all time ever" to "a range of time" (e.g. week, day, or even second).

We can do that in two ways: 1) Append part of a timestamp from the client's device to the token – or – 2) Reduce the query scope at time of analysis. When I said "easier", I was not referring to the human effort required on either developer or researcher part. It was a vague comment, and I apologise for the confusion.

Appending the timestamp portion:

  • requires a new token concept that is incompatible with our current one,
  • adds maintenance overhead for both definitions indefinitely,
  • adds confusion both for developers and researchers alike,
  • puts trust on client device rather than our infrastructure (admittedly not a big deal, given the data comes from the client anyway),
  • only helps for new data, not existing data.

I think a reduced query scope is likely something you're already doing for other reasons, and I suspect would give you more confidence by not having to rely on the implementation; it also has the side effect of naturally reducing the required uniqueness of the token.

Anyway, as mentioned in code review as well, I don't oppose the proposal if that's what you believe is best. I just wanted to make sure the decision is made with the above knowledge.

Change 453290 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] PageIssues should use new standard pageToken getter

https://gerrit.wikimedia.org/r/453290

Nuria added a comment.EditedAug 17 2018, 2:31 PM

That said, we also want to keep this future-proof so that users don't even have to bother with such calculations, and it's not totally inconceivable that it could on occasion be used for larger sets too - e.g. the Virtualpageview table currently consists of 7.5 billion link interactions (it >doesn't use a token for them and is only an auxiliary schema feeding into the actual aggregate table used for analysis, but this illustrates the possible dimensions).

@Tbayer: Sorry, I do not understand this example, could you clarify? I think do not understand what the pagetoken represents in this and future experiments. Are you trying to identify the interaction of a user with a pageview going forward so it can be cross referenced in all schemas? So the "unit of diversion" of your AB test is the pageview not the session? (asking cause current implementation is not doing that as far as I can see)

Change 453290 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] PageIssues should use new standard pageToken getter

https://gerrit.wikimedia.org/r/453290

Status update: +2ed the WikimediaEvent patch but that's blocked on the EventLogging patch which I've only +1ed, pending the ongoing conversation here between @Krinkle, @Nuria @Tbayer.

I'm thinking we might need to sync in person early next week if we cannot work out the misunderstanding over Phabricator. I'd hope a 15 minute cool would get us on the same page.

Nuria added a comment.EditedAug 20 2018, 5:22 PM

@Jdlrobson Let's make sure to clarify how this token is to be used in future experiments to +1 patch. For example: (1) page previews is mentioned and as far as I know it does not load EL module, thus it does not have access to this page token at all (for now, this might change in the near future). Also (2) for events to be able to access the same token they both must be sampled within the same pageview (as code stands now) .

Let's confirm we are all in the same page with issues before calling the code done.

That said, we also want to keep this future-proof so that users don't even have to bother with such calculations, and it's not totally inconceivable that it could on occasion be used for larger sets too - e.g. the Virtualpageview table currently consists of 7.5 billion link interactions (it >doesn't use a token for them and is only an auxiliary schema feeding into the actual aggregate table used for analysis, but this illustrates the possible dimensions).

@Tbayer: Sorry, I do not understand this example, could you clarify?

As I indicated, this particular EL schema/table (VirtualPageView) is not directly used for analysis currently and we are not proposing to enrich it with a page token at this point. However, I think there is agreement that we want to use this task to provide a standard, reliable way to generate a page token that can also be used in future experiments. And this example shows that we should not rely on these always involving less than 500 million tokens, considering that VirtualPageview table already contains over an order of magnitude more page interactions. So to prevent token collisions (and the data confusion they can create) with reasonable certainty in future applications, and avoid developers and data analysts having to worry about them in the first place when designing a new experiment, we should include parts of the timestamp or other sources of entropy.

I think do not understand what the pagetoken represents in this and future experiments. Are you trying to identify the interaction of a user with a pageview going forward so it can be cross referenced in all schemas? So the "unit of diversion" of your AB test is the pageview not the session? (asking cause current implementation is not doing that as far as I can see)

No, the "unit of diversion" in the upcoming page issues experiment (like previously for Popups/page previews) is indeed sessions, not pageviews. Rather, the purpose of the pagetoken field has always - since back in 2016 when it was introduced for Popups - been to uniquely identify a particular pageview within the dataset, e.g. because there can be several events during that pageview and the analysis requires being able to connect them. This is not changing. For the upcoming page issues experiment, we want to track a metric for which data was already being recorded in a separate, existing EL schema (ReadingDepth), and the web team decided to keep using that separate schema alongside the new PageIssues schema. So we need the page token to have the same value for events sent during the same pageview in either schema.

Nuria added a comment.Aug 20 2018, 9:02 PM

However, I think there is agreement that we want to use this task to provide a standard, reliable way to generate a page token that can also be used in future experiments. And this example shows that we should not rely on these always involving less than 500 million tokens, >considering that VirtualPageview table already contains over an order of magnitude more page interactions

Ok, I think I understand better. Now, the code as is now does not assign a token per interaction (an interaction being different previews of several pages in a main page, for example). Rather it assigns a token per main page. Thus a user reading, say, barack obama page and doing a preview of every single page linked on that page (100's of interactions) has access to one pageview-token, the one that was issues when the main barack obama page was loaded. The token is not unique per interaction, is unique per main page viewed.

@Tbayer If I understand correctly, you're stating that the current token is unique enough for our purposes. That's good to hear and obsoletes most of my other questions.

Depending on the precise meaning of "current" here, we may still have a misunderstanding.
To back up for extra clarity: Per the task description, the pageToken as it is live in production currently (at least for ReadingDepth) is calculated as follows:

user.generateRandomSessionId() + Math.floor( mw.now() ).toString() + user.generateRandomSessionId()

I totally agree with your earlier observation ( T191248) that using two of these random IDs plus the full timestamp has been overkill, and that one of them plus the timestamp should suffice. I also agree that further reducing the timestamp's contribution as in @Niedzielski's patch is reasonable. But it looks like we may differ on how safe it would be to not use the timestamp at all and solely rely on one random ID.

If at some point it becomes a concern, either due a higher insertion rate, or due to the data set we have steadily increasing over time;

Yes, I guess you mean a similar kind of the scenario here as I have been cautioning against above in T201124#4503440 (alongside another aspect we haven't fully discussed yet here: a potential lack of randomness from generateRandomSessionId()for some browsers).

I was thinking how we could mitigate that. It seems we both think we can mitigate that by doing what I call "fragmenting the token by time", that is, reducing the scope over which the token is expected to be unique from "all time ever" to "a range of time" (e.g. week, day, or even second).
We can do that in two ways: 1) Append part of a timestamp from the client's device to the token – or – 2) Reduce the query scope at time of analysis. When I said "easier", I was not referring to the human effort required on either developer or researcher part. It was a vague comment, and I apologise for the confusion.
Appending the timestamp portion:

  • requires a new token concept that is incompatible with our current one,

From the the data analysis standpoint it seems we will break compatibility anyway, by removing at least the second random ID.

  • adds maintenance overhead for both definitions indefinitely,
  • adds confusion both for developers and researchers alike,

No to dismiss these two concerns (see also my alternative suggestion below), but on the other hand it seems we have already been using differing definitions for different tokens during the last two years at least.
And at least from the data analyst/data scientist standpoint, not having to worry about the uniqueness of (at least) the pagetokens probably outweighs any confusion that may arise from pagetokens and other random IDs being calculated in a different way.

  • puts trust on client device rather than our infrastructure (admittedly not a big deal, given the data comes from the client anyway),
  • only helps for new data, not existing data.

Well of course the changes proposed in this task will affect only new data, regardless of which option we take...

I think a reduced query scope is likely something you're already doing for other reasons, and I suspect would give you more confidence by not having to rely on the implementation; it also has the side effect of naturally reducing the required uniqueness of the token.

First, in practice, queries for such experiments routinely cover scopes that are longer than envisaged in this proposal (e.g. a timespan of eight weeks in one of the page previews A/B tests). Having to break them up into smaller chunks of time would cause significant overhead for analysts, not to speak of the burden to always remain aware of the necessity of this workaround and document the tokens' unreliability (non-uniqueness) in the absence of it.
But what's more, another serious concern about option 2) is that events with the same pagetoken will frequently fall into different time compartments and thus would not be joinable in this approach. E.g. from our preliminary examination of the ReadingDepth data, it looks like a third or more pageviews last at least a minute (in the sense of the time passing between the first and the last event during that pageview).

Anyway, as mentioned in code review as well, I don't oppose the proposal if that's what you believe is best. I just wanted to make sure the decision is made with the above knowledge.

Yes, I think @Niedzielski's version with the last 9 digits of the timestamp added is the best option we have so far. But just so as not to ignore the code consistency concerns you brought up:

Another alternative might be to directly increase the entropy provided by user.generateRandomSessionId(), e.g. by including the 9 timestamp digits there alongside the part form Math.random. I think this may make for slightly cleaner code and could also have the added benefit of increasing protection against collisions in other future applications as well. (I guess the current 64 bits were chosen during a time where EL was confined by the constraints of the old MySQL/MariaDB setup, before the new Hadoop EL infrastructure that was introduced last year made it possible to have much larger tables/experiments.) That said, it might have wider repercussions which would not be worth sorting out as part of this task.

@Tbayer Nuria, Jon and I had a chat on IRC and I'd like your feedback on our proposal:

Nuria brought up something we forgot: The schemas also have a sessionId (per browsing session, multiple page loads), and tests are already shared by site. Assuming confidence in sessionId (which was based on 500M uniques per month globally, whereas we only need it per site), that means pageviewToken should be fine to leave as only being based on user.generateRandomSessionId(). Because it only has to be unique within each session.

That removes the need for adding timestamps.

The proposal is to use the new getPageviewToken() method from https://gerrit.wikimedia.org/r/453290 in Popups and ReadingDepth as-is.

This means:

  • For Popups, the format of pageToken stays the same as it is now (coming from EL getPageviewToken instead of directly from generateRandomSessionId).
  • For ReadingDepth, the format of pageToken will change from generateRandomSessionId + mw.now + generateRandomSessionId to just generateRandomSessionId (from EL getPageviewToken).

Does that sound good?

@Tbayer Nuria, Jon and I had a chat on IRC and I'd like your feedback on our proposal:
Nuria brought up something we forgot: The schemas also have a sessionId (per browsing session, multiple page loads),

That aspect hadn't been forgotten. Perusing the log (sorry about missing the ping earlier!), it looks like the proposal might be based on a misunderstanding:

<nuria_> Krinkle: if our AB test always includes (per tilman) a session Id the pageToken only has to be unique

It's true that the schema(s) for the upcoming page issues A/B test include a session ID, and as I already said last week above, it will likely stay below 500 million anyway (rather, the current discussion is about making the token reusable and future-proof).
But I never said that our A/B tests will "always" include a session ID. On the contrary, there will very likely be A/B tests in the future where we sample by pageview, not by session, and wouldn't want to record a session ID, or want to remove the session ID from existing data for other reasons (e.g. as a privacy precaution when releasing data). In fact, as @Jdlrobson may recall from T191532, we already seriously considered sampling by pageview instead of session for this A/B test.

And to zoom out a bit: We are in the midst of a larger effort to make A/B tests on EL less "labor-intensive and error-prone" (to quote from one current document), including by standardizing parts and making it easier to reuse them. These goals are not well served by a page token that is only reliable under certain circumstances and in a certain combination with another field and with certain caveats (see also my notes in T201124#4516687, including the still open question about browsers), or whose entropy only barely scrapes by the reliability limit with just a bit or two left to spare.

and tests are already shared by site. Assuming confidence in sessionId (which was based on 500M uniques per month globally,

BTW, that appears to have referred to the old comScore visitors number - we currently have about 1.5 billion for the new global unique devices metric for Wikipedia alone.

whereas we only need it per site), that means pageviewToken should be fine to leave as only being based on user.generateRandomSessionId(). Because it only has to be unique within each session.
That removes the need for adding timestamps.
The proposal is to use the new getPageviewToken() method from https://gerrit.wikimedia.org/r/453290 in Popups and ReadingDepth as-is.
This means:

  • For Popups, the format of pageToken stays the same as it is now (coming from EL getPageviewToken instead of directly from generateRandomSessionId).
  • For ReadingDepth, the format of pageToken will change from generateRandomSessionId + mw.now + generateRandomSessionId to just generateRandomSessionId (from EL getPageviewToken).

(seems this list is missing PageIssues - as mentioned earlier, the Popups schema is inactive for the time being and updating it is not a priority here)

Does that sound good?

Again, this doesn't work for the intended purpose of having a pageview token that can be relied on to be unique with reasonable certainty in future circumstances. I think we should go with @Niedzielski's version instead. Alternatively, any thoughts on my proposal at the end of T201124#4516687?

Following another IRC chat, I proposed to @Tbayer that if sessionId+randId is sufficient but inconvenient for queries and inconvenient to require in each schema, that we could also combine them client-side as part of the beaconed event.

However, Tilman added that it isn't just about whether we include schemaId as field for querying, we actually actively want to disallow ourselves the ability to correlate page views from the same session in some future A/B tests. If all page views from the same session started with the same prefix, that would be bad.

This convinced me that randId+timeslice is indeed the way to go.

Change 454190 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/EventLogging@master] Use mw.user.stickyRandomId() internally

https://gerrit.wikimedia.org/r/454190

Nuria added a comment.Aug 21 2018, 3:14 AM

Again, this doesn't work for the intended purpose of having a pageview token that can be relied on to be unique with reasonable certainty in future circumstances.

Ideally we would keep all tokens in hex (that way they can be consumed by InSample method, for example) so a possible solution is to do away with timestamp and just make the random generation be so in a larger space. This will keep api clean as there will be one consitent API for random token generation. How big do you think is the space in which the token needs to be random? While we might be sampling per pageview for AB tests in the future note that we are always testing per site, not globally. Thus is the per site numbers (not the global ones) that matter.

Following another IRC chat, @Nuria said that two randomId does indeed suffice, and that because we cannot re-use sessionId, it would make most sense to generate two new ones. This is what Popups is currently doing already and is worth standardising (except without the timestamp).

Further more, as optimisation, rather than calling into crypto twice, we can call it once with request for more bytes (logically identical). That appears to solve all the issues here.

  • mw.user.generateRandomSessionId (aka "generate random id") - would change from 64 bits (500M uniques) to a larger size. Two calls would equal 128 bits (32 characters instead of 16 characters). Exact number to be determined by @Tbayer and @Nuria.
  • Audit existing uses of generateRandomSessionId() and change as needed to call only once, without other additives (e.g. timestamps).
  • For rands that should persist in the browsing session, use mw.user.sessionId (same as before)
  • For rands that should persist in the page load only, use mw.user.stickyRandomId instead. This is what we'd use for pageToken in ReadingDepth and Popups. QuickSurveys uses this already.

Separate from this, @Jdlrobson, @Nuria, and myself have all been confused at one point or another with these method names. So we should definitely work on that next.

Change 454355 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/core@master] Rename stickyRandomId to getPageviewToken

https://gerrit.wikimedia.org/r/454355

Change 454359 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] getPageToken now defined in core

https://gerrit.wikimedia.org/r/454359

Again, this doesn't work for the intended purpose of having a pageview token that can be relied on to be unique with reasonable certainty in future circumstances.

Ideally we would keep all tokens in hex (that way they can be consumed by InSample method, for example) so a possible solution is to do away with timestamp and just make the random generation be so in a larger space. This will keep api clean as there will be one consitent API for random token generation.

Yes, that seems to be similar to what I suggested above at the end of T201124#4516687. It appears folks think that changing generateRandomSessionId doesn't create too much overhead, so that might be the best solution.

How big do you think is the space in which the token needs to be random? While we might be sampling per pageview for AB tests in the future note that we are always testing per site, not globally. Thus is the per site numbers (not the global ones) that matter.

That's right (the >100 million I quoted above were per project too).

Believing the formula for n(p;H) from https://en.wikipedia.org/wiki/Birthday_attack#Mathematics (without having checked it myself):

I suggesting going with 80 bits, not so much because we are likely to encounter experiments with > 100 billion different values in the next couple of years, but because it leaves a bit more of a buffer in case a browser's implementation suffers from a lack of randomness. BTW, Krinkle said the following about that caveat in the code ("Its uniqueness should not be depended on ..."):

the number of browsers to which that applies is rapidly shrinking, both due to their global use, as well as due to our increasing requirements for Grade A.

Nuria added a comment.Aug 21 2018, 9:39 PM

but because it leaves a bit more of a buffer in case a browser's implementation suffers from a lack of randomness.

The crypto implementation does not suffer of lack of randomness, it is used for crypto after all (https://developer.mozilla.org/en-US/docs/Web/API/Web_Crypto_API). The lack of randomness comes from browsers that do not support the crypto API and thus need to use other js methods like math.random (which is not as random as you might expect in some older versions of safari, for example) that is really not a worry as old browsers with issues on that regard are not even on our A-support list which is what @Krinkle meant above.

Change 454355 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.user: Rename stickyRandomId to getPageviewToken

https://gerrit.wikimedia.org/r/454355

Change 454359 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] getPageToken now defined in core

https://gerrit.wikimedia.org/r/454359

Change 454190 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Use core's mw.user.getPageviewToken() internally

https://gerrit.wikimedia.org/r/454190

Change 453077 abandoned by Jdlrobson:
New: add mw.eventLog.newPageInteractionToken()

Reason:
Looks like we don't need this right now as the linkInteractionToken needs more thought. This doesn't block the PageIssues schema so let's come back to this at a later time if needed.

https://gerrit.wikimedia.org/r/453077

Jdlrobson reassigned this task from Jdlrobson to Nuria.Aug 22 2018, 12:09 AM

We are in agreement to increase the length of mw.user.generateRandomSessionId to 80 bits
@Nuria will be changing all session identifiers for mediawiki and that same code will be used in the existing mw.user.getPageviewToken
Nothing left from web-side other than to review that patch when it arrives. Blocked until then.

Jdlrobson updated the task description. (Show Details)Aug 22 2018, 12:10 AM
Jdlrobson removed a project: Patch-For-Review.

Change 454460 had a related patch set uploaded (by Nuria; owner: Nuria):
[mediawiki/core@master] [WIP] Increase entropy on random token generator

https://gerrit.wikimedia.org/r/454460

Change 454435 had a related patch set uploaded (by Niedzielski; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] Use getPageviewToken api

https://gerrit.wikimedia.org/r/454435

Change 451886 abandoned by Jdlrobson:
Update: reading depth to use mw.user.getPageviewToken()

Reason:
No longer needed ... done in https://gerrit.wikimedia.org/r/444600

https://gerrit.wikimedia.org/r/451886

Change 454460 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.user: Increase entropy of random token generator

https://gerrit.wikimedia.org/r/454460

Jdlrobson reassigned this task from Nuria to Tbayer.Aug 23 2018, 10:18 PM

Skipping sign off. QA will be handled within T191532.
@Tbayer will sign this off during/after Friday's meeting.

Change 455040 had a related patch set uploaded (by Nuria; owner: Nuria):
[mediawiki/core@master] Small perf tweak for generateRandomSessionId

https://gerrit.wikimedia.org/r/455040

Change 455040 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.user: Small perf tweak for generateRandomSessionId

https://gerrit.wikimedia.org/r/455040

Skipping sign off. QA will be handled within T191532.
@Tbayer will sign this off during/after Friday's meeting.

For the record: The token looked OK so far in Friday's testing session, but we found a host of other issues that needed fixing, so there will be another session later this week to complete the rubber ducking.

Change 457080 had a related patch set uploaded (by HaeB; owner: HaeB):
[mediawiki/core@master] Update documentation of getPageviewToken

https://gerrit.wikimedia.org/r/457080

Change 457080 merged by jenkins-bot:
[mediawiki/core@master] Update documentation of getPageviewToken

https://gerrit.wikimedia.org/r/457080

Tbayer updated the task description. (Show Details)Sep 4 2018, 1:46 PM
Tbayer updated the task description. (Show Details)Sep 4 2018, 2:08 PM
Tbayer removed Tbayer as the assignee of this task.Sep 4 2018, 2:17 PM

We did further checks of the token in PageIssues and ReadingDepth as part of the subsequent tests last week (in Firefox and Chrome), so I've signed off on the first three AC. I also just updated the documentation on the three pages listed in the fourth AC.

As far as I can see, the last AC, about link interaction tokens, is the only thing left to do: "Please DOCUMENT this in page previews codebase (Popups extension) as it's not correctly documented" - perhaps this can be done / signed off by someone more familiar with that part of the codebase and its documentation.

Jdlrobson updated the task description. (Show Details)Sep 4 2018, 5:13 PM

Copied remaining A/C to T203013

Niedzielski closed this task as Resolved.Sep 4 2018, 5:14 PM

@Niedzielski to go ahead and update docs as part of T203013. Closing this task per discussion! \o/

By the way, can we find out / keep track of which other schemas are now using this newly standardized pageview token?

It looks like it has been adapted in QuickSurveys (https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/QuickSurveys/+/454431/ / T204921).

By the way, can we find out / keep track of which other schemas are now using this newly standardized pageview token?
It looks like it has been adapted in QuickSurveys (https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/QuickSurveys/+/454431/ / T204921).

..and to NavigationTiming and ResourceTiming (correct, @Gilles ?)

That's correct! I'm in the process of unifying the schema column names to make joins between these tables less confusing: T204921: Rename EventLogging column surveyInstanceToken to pageviewToken in QuickSurveysResponses for consistency

It would be nice to be able to rename the column on older data, which is why I filed T204922: Rename column on old hive data for a few tables