The current algorithm converts the (randomly generated) event logging session ID to an integer and then checks if it is divisible by N where 1-in-N is the sampling rate we want. For example, 0.5% rate is 1 in 200. This effectively means that we can't use anything that is a factor of N (e.g. 50, 10, 100, 40 for 200) for subsequent sampling, as was the case in the recent survey banner situation.
We should try moving to a seeded random number generation (e.g. https://commons.wikimedia.org/wiki/MediaWiki:Gadget-math.seedrandom.js) that allows us to set the seed to that same (randomly) generated session ID and then use the traditional method for getting random numbers between 1 and N, which give us very easily understood sampling code:
// assume the seed has been set to session ID function oneIn(N) { return(Math.floor((Math.seededrandom() * N) + 1)) } if (oneIn(200) == 1) { // selected for event logging if (oneIn(10) == 1) { // selected for A/B testing if (oneIn(2) == 1) { // selected for the control bucket } else { // selected for the test bucket } } else { // rejected from A/B testing, but still enrolled in EL } } else if (oneIn(50) == 1) { // rejected from EL but selected for survey banner } else { // rejected from EL and survey banner }
The logic would play out the same every time the page is refreshed as long as the user has the same session ID.