⚓ T135762 A/B Testing solid framework

Status	Subtype	Assigned	Task
Open	Release	None	T84936 Release VisualEditor-MediaWiki as "1.0"
Open		None	T50429 [Epic] Support editing parts of a page in VisualEditor-MediaWiki
Open		None	T54365 Explore performance gains from progressive (JIT?) de-alienation in VisualEditor
Open		None	T174303 Copy-pasting linked ISBN numbers from view mode HTML into VisualEditor inserts wikitext links to Special:BookSources (it should turn them into magic links?)
Open	Feature	None	T54091 The read HTML should have hinting to allow full DOM copying (as opposed to just rich copying) from read mode into VE surfaces
Open		None	T55784 [EPIC] Use Parsoid HTML for all page views
Resolved		dr0ptp4kt	T114542 Next Generation Content Loading and Routing, in Practice
Duplicate		• Jhernandez	T104432 [EPIC]: Improve mobile site performance
Duplicate		dr0ptp4kt	T120341 [GOAL] Make Wikipedia more accessible to all connections with new fast API-driven web experience in mobile web beta
Declined		None	T125920 [EPIC] Future exciting reading web performance endeavours
Resolved		Jdlrobson	T113066 [GOAL] Make Wikipedia more accessible to 2G connections
Resolved		Jdlrobson	T124390 [GOAL] Load images with care
Declined		BBlack	T127883 Enable lazy loaded images for 50% of users in production
Declined		BBlack	T135762 A/B Testing solid framework
Resolved		• Nuria	T143694 Preliminary Design document for A/B testing

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 19 2016, 6:57 PM

• Nuria edited projects, added Analytics; removed Analytics-Kanban.May 19 2016, 6:57 PM

From irc conversation in wikimedia-operations w/ @Nuria, @Krinkle, and myself. This is the varnish-level pseudo-code proposed (ignore arbitrary names and constant integers, to bikeshed later):

receive_request {
  if (hasCookie("SampleBucket")) {
    Bucket = getCookie("SampleBucket")
  }
  else {
    SB = rand(0,1000);
    setResponseCookie("SampleBucket", SB);
    Bucket = SB;
  }
  if (Bucket > 100 && Bucket < 150) { // 5%
    setApplicationRequestHeader("X-Feature3: on");
  }
}

hash_request { // this splits varnish cache
  if (Bucket > 100 && Bucket < 150) { // 5%
    split_cache_for("X-Feature3");
  }
}

And then we need to template this out so that we can have declarative puppet data assigning test bins ranges to experiments, without modifying actual VCL code directly:

experiments => {
    'X-Feature3' => [100, 150],
}

And then the application code turns the feature on for requests that have the X-Feature3: on header, and doesn't if they don't.

Restricted Application added a project: SRE. · View Herald TranscriptMay 19 2016, 8:18 PM

Looks great, one minor nit: rather than having distinct cookies per feature we can have one weblab cookie that contains all features and bucketing for those.

"Set cookie Weblab:"feature-X=A""

Note that for an experiment like lazy loading in which we might want to test lazy loading in10% of user base we want to split that 10% in two groups: 5% of users belong to group A (lazy loading on) and 5% belong to group B (lazy loading off). Thus the cookie needs to contain whether user belongs to group A or B.
This is important cause we do not want to compare the users with lazy loading enabled to the user base at large as we might run into those users being part of other experiments.

This grouping is important to analyze the data, the application layer just cares whether to turn a feature on.

The analytics will care that for every experiment there is a control group.

Other nits and notes:

Don't send the cookie to the applayer, just the feature header
Validate the cookie's value, clear+reset if invalid
Block the feature headers coming directly from the client (client can't self-select a feature accidentally, only Bin# can)
The way the cache-split and mechanism works, it's legal to create overlapping bins. We can test A-only 5%, B-only 5%, and A+B 5% features by assigning bins 100-200 to featureA and 150-250 to featureB.

nshahquinn-wmf subscribed.May 19 2016, 8:56 PM

Better pseudo-code, after more conversation:

Data structure (which we update as we add/remove experiments):

experiments => { # 100 total bins to use: 0-99
    'FeatureX/A' => [0, 4], # 5% of all users
    'FeatureX/B => [5, 9], # 5% of all users
    'FeatureY/A' => [10, 14], # 5% of all users
    'FeatureY/B => [15, 19], # 5% of all users
}

Code which is generated based on that data:

receive_frontend_request {
  // Static code:

  unsetHeader("X-Weblab");

  if (hasCookie("SampleBucket")) {
    // we'll also validate that the value looks sane, and
    // treat it as non-existent if it's not valid
    Bucket = getCookie("SampleBucket")
  }
  else {
    Bucket = rand(0,100);
    setResponseCookie("SampleBucket", Bucket);
  }

  // templated out automatically from 'experiments' data above:
  if (Bucket >= 0 && Bucket <= 4) {
    setRequestHeader("X-Weblab:", "FeatureX/A");
  } else if (Bucket >= 5 && Bucket <= 9) {
    setRequestHeader("X-Weblab:", "FeatureX/B");
  } else if (Bucket >= 10 && Bucket <= 14) {
    setRequestHeader("X-Weblab:", "FeatureY/A");
  } else if (Bucket >= 15 && Bucket <= 19) {
    setRequestHeader("X-Weblab:", "FeatureY/B");
  }
}

hash_request { // splits varnish cache, at all layers (not just frontend)
  hash_data(getHeaderValue("X-Weblab"));
}

analytics_deliver {
   appendToHeader("X-Analytics", ";weblab=" + getHeader("X-Weblab"));
}

If the setRequestHeader bits didn't use else-if, you could have overlapping buckets with multiple features in play too, but this seems simpler for the moment and doesn't require setting or parsing multi-value X-Weblab: headers for the application, nor multi-value ;weblab= for X-Analytics.

Another complication that didn't come up in conversation earlier: what about domainnames for these cookies? They'll be getting binned independently for every language by-default. If we set wikipedia.org -level cookies, it's per-project at least.

Another thought: with the above code, I've intentionally set it so that both sides of the experiment will initially share an empty cache split (they'll have to populate their own cache entries from scratch at the start of the experiment). Turning this on for 10% of the userbase for a feature will be notable in cache hitrates overall (spike appserver load slightly). It's probably fine for a single case, but we'll have to be wary of turning too many features off and on too rapidly and effectively invalidating too much cache all at once.

Jdlrobson subscribed.May 19 2016, 9:38 PM

dr0ptp4kt subscribed.May 19 2016, 10:22 PM

• Tbayer subscribed.May 20 2016, 2:45 AM

phuedx subscribed.May 20 2016, 8:13 AM

leila awarded a token.May 20 2016, 10:30 AM

leila subscribed.

@BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at libvmod-abtest. If so, what did you think?

In T135762#2312406, @phuedx wrote:

@BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at libvmod-abtest. If so, what did you think?

It hasn't been updated in over 3 years, and probably isn't varnish4-ready
The configuration update mechanism is via restricted PUT/DELETE to the servers (which we'd have to replicate to a lot of servers).
It doesn't document how it really works
Looking at the code/examples to infer how it works, it doesn't seem to do what we want, at least not completely. It would make up some randomized per-client cookie values, but it's still up to us to solve the integration, first-hit, and analytics problems, etc.
On the upside, one thing it has that our pseudo-code lacks so far is keying a test on less than all possible request keys (e.g. by hostname or URL regex). If we know a test can only possibly affect certain portions of traffic and not others (e.g. it's limited to /w/api.php, or to /api/rest_v1, etc...), having that capability would let us further reduce the cache impact of turning testing on and off (because they would continue to share the global cache on unaffected paths/hostnames).

We could add (5) to our data model fairly easily, though.

This all sounds great and I love that its generic and can be reused again!

A few clarifications - if I'm understanding correctly experiments would be configured in puppet? Would it be possible to configure an experiment across all wikis?

What's a realistic timeline to get such a thing in place?

Quick question: how does this guarantee bucketing across browser restart?

Non session cookies are kept after browser restarts, with an expiration set of 30 days (like last access cookie) the cookie is available.

I should note in the concrete cases: persistence across browser restart is probably not as important for lazy loaded images, whereas persistence across browser restart is important for Hovercards.

@Nuria, should we add that to the Description as acceptance criteria?

In T135762#2312919, @Jdlrobson wrote:

A few clarifications - if I'm understanding correctly experiments would be configured in puppet?

That would be the easiest way, yes. I don't think it's a problem for this kind of use-case, as every experiment needs discussion and planning anyways, so gerrit reviewing the change through and waiting on the 30 minute deploy shouldn't be a big hinderance.

Would it be possible to configure an experiment across all wikis?

That part is tricky because of the domain-limitations of cookies. The bins are assigned persistently by-cookie, and a singular cookie wouldn't be valid across project domains (e.g. wikimedia.org + wikipedia.org, or wikiversity.org + wiktionary.org). So users would get binned differently on different domains, which doesn't map 1:1 with wikis. For the language-hostname projects, we can set a bin-cookie for all languages within one project domain. For all the wikis under wikimedia.org, we can set a shared bin-cookie for that domain.

The experiment could/would be configured for all projects/languages (or not on a case-by-case basis, if we implement limiting experiments by URI regex). It's just that a given client's binning wouldn't be consistent across them. They might be in the experimental feature bin for wikipedia.org, in the control group for wiktionary.org, and in neither group on wikiversity.org. The analytics headers would still map it out correctly though on a per-request basis, and you'd still have the right percentages.

Where we get into trouble on this, potentially, is project-shared URIs, like ResourceLoader (which is special, the caching for it is shared cross-project/domain). I'm not sure how we solve that problem for experiments that affect RL content. There's probably a way.

What's a realistic timeline to get such a thing in place?

I don't know yet. We have a lot of things going on in parallel right now in general. I can push some other things aside and try to get it ready "soon", but I can't quantify "soon" yet.

In T135762#2312939, @dr0ptp4kt wrote:

Quick question: how does this guarantee bucketing across browser restart?

In T135762#2312967, @Nuria wrote:

Non session cookies are kept after browser restarts, with an expiration set of 30 days (like last access cookie) the cookie is available.

Since the binning is done independently of actual experiments (the binning is live all the time for all cookie-enabled agents), this actually is a problem, I think. We probably want the binning cookies to persist for a very long time, or else agents will move in and out of a binned experimental set during even short-term experiments (within browser-cache-life of loaded resources?).

Since the binning is done independently of actual experiments (the binning is live all the time for all cookie-enabled agents), this actually is a problem, I think. >We probably want the binning cookies to persist for a very long time, or else agents will move in and out of a binned experimental set during even short-term >experiments (within browser-cache-life of loaded resources?).

Seems to me that any methodology we use -and this includes cookies and local storage both and even unique tokens- is subjected to drop some users due to:

cookies getting deleted

users using incognito mode (expires cookies early, restricts usage of local storage)

cookies expiring

I think that statistically (given the size of our samples) this should not be a problem as long as you have a large enough sample size to measure an effect. Have in mind that we also have data on how many of the "so-called-user-requests" come without any cookies whatsoever.

Right, but if the user deletes cookies or goes incognito, that's probably a rare event for most, and possibly associated with not re-using browser cache across the barrier, either. If we set 30-day cookies every time we see a browser that lacks the cookie, the expiries (which results in assigning a new random bin) across all users will be constantly-cascading. Even in a 1-day experiment window, we'll see lots of users shifting bins randomly even in the far future.

In the nearer-term, we'd see significant re-binning right on the 30 day boundary from when we first turned this on. It would take quite some time before the 30-day renewals naturally spread themselves evenly and we get down to only the sort of problem described above. We could randomize the expiries to lessen that initial effect, but it wouldn't solve the longer-term issue.

Long-term cookie strategies we could take that would mitigate that effect:

We could send insanely-long expiries (many years). With only 100 bin slots, I don't think we're really causing any tracking/privacy harm here.
We could set pseudo-randomized shorter expiries (say 15-45 day window) and try to randomly refresh the cookie for 1/N requests from clients with existing cookies (so we're not constantly re-setting it, but active clients have a good chance of persisting the cookie so long as they don't go dormant from our site for a long period). It's more-complicated than (1) though.

I think 1-100 is too small. Especially considering the amount of traffic we have, and considering most of our experiments will not have been load tested very much, and tend to have instrumentation of some kind.

For initial phases of a feature, 0.1% is already more than enough. In fact, I would like to recommend an experiment never going above 1%. If we're ready to cross 1%, might as flip the default and go 100% through normal cache rollover.

For comparison, our entire Navigation Timing data used to be based on only 0.01% sampling. It is now turned up to 0.1% (1:1000 sample; $wgNavigationTimingSamplingFactor). And that's not even experimental. That's the comfortable upper limit of what we can responsibly handle given current load capacity of EventLogging/statsd/Graphite.

To address @BBlack's cache hit concern, I'd like us to consider applying the cache buckets to all users (as proposed) but generally only ever using a small portion of it for experiments (e.g. only 1-100 out of 0-10,000, so only 1%).

An experiment could start in 1 bucket (0.01%) and work its way up to 10 (0.1%). And if the experiment wants has load concerns, once any verbose logging is disabled, we could continue the experiment and bump it to cover all of the 100 experiment buckets (1%; 100 out of of 10,000) - which would make it overlap with other experiments potentially and is a good field trial before enabling by default (which would also naturally make it apply to other experiments).

In this proposal, the other 99% (i.e. 101-10,000 or something) would always be be left alone and get normal cache behaviour.

In T135762#2313020, @Krinkle wrote:

[...]
An experiment could start at 1 bucket (0.01%) and work its way up to 10 (0.1%). And if the experiment no longer has verbose logging and has load concerns, we can continue an experiment and bump it to over all of the 100 experiment buckets (out of of 10,000) - which would make it overlap with other experiments potentially.

In this proposal, the other 99% (101-10,000) would be always be left alone and get normal cache behaviour.

Well, we could do this sort of thing without expanding the bin number range (avoiding its potential use for tracking-correlation), I think. For instance, when assigning a fresh binning cookie, we could do rand(0,10000), and then set the cookie as an actual number for 0-99, and simply set the value '100' for the range 100-10000 (and we ignore the 100 value for cache-hashing and don't send it to analytics/application, etc). So only 1% of the userbase ever gets a unique bin number, and the rest all get the same value just to suppress assigning them a new one.

Keeping the sample sizes smaller (1% or less of total users) really does help a lot not only with backend load concerns, but also cache invalidation concerns at the varnish level.

For comparison, our entire Navigation Timing data used to be based on 0.01% sampling. It is now tuned up to 0.1% (1:1000 sample; >$wgNavigationTimingSamplingFactor). And that's not even experimental. That's the comfortable upper limit of what we can responsibly handle given current >load capacity of EventLogging/statsd/Graphite.

I know you know this but just clarifying that we do not have these restrictions here though, the restrictions come from TTLs on varnish not EL in any way.
on
I also like to note that 30 days of experimenting is a fair interval. Note that you couldn't see an effect on pageviews of an experiment after 60 days cause we discard distinct requests data and x-analytics headers after that period.

In T135762#2313083, @Nuria wrote:.

I know you know this but just clarifying that we do not have these restrictions here though, the restrictions come from TTLs on varnish not EL in any way.

Yes, but some features may be less-cacheable than others, and even the cacheable ones' rare cache misses could cause horrible perf fallout elsewhere in the clusters. For something generic to all experiments in the future, we really don't know how big the impact will be per experimental user.

I also like to note that 30 days of experimenting is a fair interval. Note that you couldn't see an effect on pageviews of an experiment after 60 days cause we discard distinct requests data and x-analytics headers after that period.

I agree the experiments falling within 30-60 days max seems like a reasonable limit. The problem is that in the current design the binning cookies are independent of the stop/start of any given experiment. They set and persist all the time, even when no experiments are running. So even during a short experiment, there's a constant cascade of fixed cookie expirations happening mid-experiment, re-binning users (edited to add: if we don't use very-long-term cookies).

It's possible we can come up with a way to turn them on dynamically as each experiment rolls in and keep them non-overlapping (and keep the percentages fairly accurate), but I don't yet see how without introducing other problems off-hand. Would have to think harder about that one first.

• Nuria moved this task from Incoming to Radar on the Analytics board.May 23 2016, 4:28 PM

I have a quick suggestion to make this play nice with client-side sampling. First, the problem:

some Event Logging instrumentation randomly only sends back data for some percentage of its users. If this random sample ends up overlapping with a bucket that has a certain experiment applied, that experiment might disproportionately influence the Event Logging data.

To fix this, we could make a bucket for each sampling rate we want to instrument at (10%, 25%, etc.) and set a flag that the client could use to turn data collection on. This way, we could keep data collection and experiments separate if we want, or not, but we'd have more control.

Jdlrobson added a parent task: T127883: Enable lazy loaded images for 50% of users in production.May 26 2016, 1:21 AM

ori moved this task from Inbox, needs triage to Radar on the Performance-Team board.Jun 6 2016, 5:37 PM

Jan_Dittrich awarded a token.Jun 29 2016, 6:19 AM

MoritzMuehlenhoff triaged this task as Medium priority.Jul 8 2016, 10:12 AM

elukey subscribed.Jul 12 2016, 4:31 PM

• ema subscribed.Jul 22 2016, 3:16 PM

Noting from last meeting about this: We've tentatively said we'll try to make this (implementing a robust A/B test infrastructure at the Varnish level) an Ops/Traffic goal next quarter (Q2 16-17), alongside our obvious likely primary goal of text cluster conversion to Varnish 4. This has been pending for quite a while in the background, and it's important to get some real time dedicated to this soon. As the planning for next quarter approaches of course we'll have to evaluate this versus other competing desired goals.

Second @BBlack. We will make this a shared goal among traffic and analytics team

What's the rationale for prioritizing it?

It's a seasonal issue that's come up every few months for the past couple of years. Every time we need to run an A/B test, we go back through the same conversation about how hard (or not) it will be to implement which hacky solution for a single feature's testing directly in VCL, and how those tradeoffs work. Things like long-term vs short-term, whether it affects initial site visit from a new browser/user, whether it's per-request or per-(device? user? IP?), whether it will be statistically valid, whether it will cause cache fragmentation performance issues when it's turned on or off, etc. My rationale for prioritizing this is we shouldn't be asking all these basic questions and hacking directly on VCL every time we want to test a feature with unknown impact - it wastes a lot of time and energy, and impacts the ability of feature developers to get an accurate estimate of the impact of their changes in a reasonable amount of time and effort (and blockage on ops).

@BBlack: I volunteer to write a design doc with use cases /high level design ideas and issues by the end of this quarter so we can use it to scope the work we will need to put towards this project next quarter.

@Nuria - Thanks, sounds awesome :)

• ellery subscribed.Jul 26 2016, 7:26 PM

@BBlack, @Nuria
In order to run a randomized controlled experiment, you need to ensure that users are randomly assigned to treatment conditions at the start of every experiment and that they remain in their treatment group for the entire duration of the experiment.

You called out the second point earlier in the thread, noting that the proposed scheme of using cookies with a short expiry time will lead to users moving between treatment conditions within an experiment. For experiments on logged-in users, taking a hash of the user id and a "per-experiment salt" for treatment assignment is probably best. For experiments on "readers", we need to think carefully about how to minimize this problem and talk to other organizations that do AB testing without a user log-in.

As far as I can tell, the proposed method also violates the more important property that users need to be randomly assigned to treatment conditions for each experiment. Instead, experiments are assigned to groups of users, which were partitioned at random. This means that past experiments are likely to interfere with future experiments.

For example, say we assign users at random to groups X and Y. In experiment 1, we assign X to treatment A and Y to treatment B. X and Y are statistically indistinguishable, so we can trust the results of this experiment. But if we want to run another experiment, we have the problem that group X is systematically different from Y due the impact of experiment 1. This means, we cannot trust the results from experiment 2, 3, etc.

In T135762#2497082, @ellery wrote:

As far as I can tell, the proposed method also violates the more important property that users need to be randomly assigned to treatment conditions for each experiment. Instead, experiments are assigned to groups of users, which were partitioned at random. This means that past experiments are likely to interfere with future experiments.

For example, say we assign users at random to groups X and Y. In experiment 1, we assign X to treatment A and Y to treatment B. X and Y are statistically indistinguishable, so we can trust the results of this experiment. But if we want to run another experiment, we have the problem that group X is systematically different from Y due the impact of experiment 1. This means, we cannot trust the results from experiment 2, 3, etc.

I think we could minimize that effect by introducing another layer of abstraction above the basic binning. For instance, we could do long-term stable binning of all users into, say, 1000 groups (0.1% each), and then select a random subset of available (not in use by other concurrent testing) small bins to make up the set of users for a given test. E.g. a test that needs 5% of the anonymous userbase (2.5% affected + 2.5% control) randomly picks 50/1000 of the persistent bins at the start, to be statically assigned to it and no other test for the duration of that test.

For experiments on "readers", we need to think carefully about how to minimize this problem and talk to other organizations
that do AB testing without a user log-in.

@ellery, in our case we have to do work with 'regular' cookies, unlike other organizations we do not have logged in users neither sessions (in the case of readers) so comments thus far refer to the "readers" use case. Once you have logged in users things are a lot easier cause the code can be more deterministic.

For example, say we assign users at random to groups X and Y. In experiment 1, we assign X to treatment A and Y to treatment B. X and Y are statistically >indistinguishable, so we can trust the results of this experiment. But if we want to run another experiment, we have the problem that group X is systematically >different from Y due the impact of experiment 1. This means, we cannot trust the results from experiment 2, 3, etc.

mmm...I do not think in practice this happens in our case if a bucket represents control and treatment for 1 experiment. Let me explain: we do not have many concurrent experiments. In our proposed scheme we partition the user base in, for example, 100 buckets , which means that we can run 100 experiments at any one time (a bucket will have control and treatment for 1 experiment) which is about 1 order of magnitude more than what we really need.
So it will take us a while to recycle buckets, it will take us longer to recycle buckets than users moving across buckets due to cookie expirations/removals which is an unavoidable issue and which impact we need to quantify.

In T135762#2497291, @BBlack wrote:

In T135762#2497082, @ellery wrote:

As far as I can tell, the proposed method also violates the more important property that users need to be randomly assigned to treatment conditions for each experiment. Instead, experiments are assigned to groups of users, which were partitioned at random. This means that past experiments are likely to interfere with future experiments.

For example, say we assign users at random to groups X and Y. In experiment 1, we assign X to treatment A and Y to treatment B. X and Y are statistically indistinguishable, so we can trust the results of this experiment. But if we want to run another experiment, we have the problem that group X is systematically different from Y due the impact of experiment 1. This means, we cannot trust the results from experiment 2, 3, etc.

I think we could minimize that effect by introducing another layer of abstraction above the basic binning. For instance, we could do long-term stable binning of all users into, say, 1000 groups (0.1% each), and then select a random subset of available (not in use by other concurrent testing) small bins to make up the set of users for a given test. E.g. a test that needs 5% of the anonymous userbase (2.5% affected + 2.5% control) randomly picks 50/1000 of the persistent bins at the start, to be statically assigned to it and no other test for the duration of that test.

I agree with @ellery that this is a problem, and I like this solution. @Nuria may be right that with the current way we run experiments this problem may not happen often, but I think that's very likely to change once we have this framework in place.

• AlexMonk-WMF mentioned this in T135478: Provide mechanism to A/B test VE vs. WT as default editor for a proportion of IPs.Jul 28 2016, 9:56 PM

greg subscribed.Jul 28 2016, 10:01 PM

@Nuria, @BBlack
I need to clarify that in the example that I gave above, the experiments were not run concurrently, but in sequence.

@Nuria
I'm confused about how your statement "a bucket will have control and treatment for 1 experiment". I though that a bucket represents a group of users that get assigned to either the treatment or the control.

Another issue that is independent of proper randomization, is that for most use cases, the data produced by the system cannot be used for statistical testing. Let me give an example;

Say we want to change the edit button and see if the change results in more clicks. Whenever the button is shown or clicked, there will be a logging call to that effect, which includes the condition the user is in.

If you want to test the if the new button leads to more clicks you might compare the ratios of clicks/impressions for the two groups. The problem is, however, that your "observations", the impressions, are not independent, because subsets of them are generated by the same users, and so you cannot do proper statistical hypothesis testing, which assumes independence of your observations.

In order to do statistical testing, you would need to compare the fraction of users who clicked the button at least once or the average number of clicks per user between the groups. In other words, you need to be able group impressions by user. I can think of only 2 ways to do this:

use a unique token per user or a unique token per user, per experiment.

ensure that a user is in an experiment for only a single pageview, which is very restrictive.

• Elitre subscribed.Aug 2 2016, 9:34 AM

• Nuria created subtask T143694: Preliminary Design document for A/B testing.Aug 23 2016, 4:24 PM

The problem is, however, that your "observations", the impressions, are not independent, because subsets of them are generated by the same users, and so
you cannot do proper statistical hypothesis testing, which assumes independence of your observations.

@ellery: I think there might be several incorrect assumptions here and perhaps is best to clarify these over a design doc rather than phabricator comments.

(edited)

Differences can be attributed to a change in parameters if treatments are assigned to users completely at random. And in this case, they are, buckets are randomly assigned. Let's talk about this some more if you still see an issue.

I think you are speaking of events not being trully independent as we are measuring an action by a user more than once so events are not "random". Even in this case you can empirically calculate variability of your metrics and measure experiment results, i think that is standard industry practice as most commonly experiments are segmented by user to provide a consistent UI experience.

Some info on this, see 5.2 section:
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36500.pdf

Let's take further discussion to the design doc and come back to ticket once we have a more refined design:

https://docs.google.com/document/d/1jRGjVAthJXoCovxyvXWyg07R1POb8zvD_n8IlJXrPVM/edit#

mpopov subscribed.Sep 15 2016, 4:10 PM

• HJiang-WMF subscribed.Sep 16 2016, 11:38 PM

In T135762#2506770, @ellery wrote:

In order to do statistical testing, you would need to compare the fraction of users who clicked the button at least once or the average number of clicks per user between the groups. In other words, you need to be able group impressions by user. I can think of only 2 ways to do this:

use a unique token per user or a unique token per user, per experiment.

ensure that a user is in an experiment for only a single pageview, which is very restrictive.

@ellery, I talked about this with @mpopov Friday, and he told me that Discovery uses unique tokens as standard practice in their experiments. (They set an experiment-specific cookie and log the token to Event Logging along with the rest of the data.) Are you concerned that unique tokens aren't acceptable under our privacy practices (apparently they are), or do you mean that the A/B testing framework should provide unique tokens out of the box?

@Neil_P._Quinn_WMF I'm saying that for any online AB test you to be able to group the experimental data by user. The proposed framework does not provide a mechanism to do this. It is great that Discovery uses a per-experiment unique user token to do user-level grouping. The system that fundraising uses does not do this, leading to many false positive test results.

@Nuria I certainly don't disagree that segmentation must be done at the user level. I'm saying that the test statistics (or metrics as you are calling them) also need to be computed at a user level (i.e. compare the average number of clicks per user between treatment and control instead of just comparing the number of clicks across all users between treatment and control). To do this, there needs to be some way of grouping data by user in each experiment. The current proposal is missing a mechanism to achieve this.

@ellery, I talked about this with @mpopov Friday, and he told me that Discovery uses unique tokens as standard practice in their experiments. (They set an >experiment-specific cookie and log the token to Event Logging along with the rest of the data.) Are you concerned that unique tokens aren't acceptable under our >privacy practices (apparently they are

There are many issues with unique tokens and tons of documents have been written on this regard. We should not use them if possible, in this case they are confined to the experiment and to web users with js enabled and localstorage support. That is a narrow range but still, somewhat concerning as they can be used to break users privacy that we are bound to protect. These tokens are also decoupled from pageviews (i.e. there is no way for you to correlate other data to these tokens)

Let's please keep comments of this nature to design doc, otherwise this ticket is going to be impossible to read and organize.
https://docs.google.com/document/d/1jRGjVAthJXoCovxyvXWyg07R1POb8zvD_n8IlJXrPVM/edit#

@Neil_P._Quinn_WMF , @ellery Please have in mind that in any of discovery's test there is no knowledge as to whether the user is part of another test (ex: hovercards) and that is one of the reasons why we need a centralized system.

• ema moved this task from Backlog to Caching on the Traffic board.Sep 30 2016, 3:18 PM

• Gilles removed a project: Performance-Team.Dec 6 2016, 4:06 PM

Jan_Dittrich subscribed.Dec 8 2016, 4:24 PM

Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure?

Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure?

couldn't say w/o knowing how central banner's infrastructure works. I hope to have a design document out soon but while we were trying to come up with a better scheme our design is pretty much a formalization of what we have: eventlogging client plus some bucketing infrastructure in varnish

Design document available in meta:
https://meta.wikimedia.org/wiki/Research:PrivacyConsciousABTestingAtWikimediaFoundation

• Nuria closed subtask T143694: Preliminary Design document for A/B testing as Resolved.Mar 22 2017, 7:48 PM

Is there some progress on this issue? The last experiments I was in contact with (and responsible for thinking up) were still hack-ish.

I wonder if packaging one of the hacks or a small AB library, so we at least fix problems on the same hack/small thing, if we fix them. (PlanOut might be even nicer, but I assume that did not have a code review)

@Addshore

We will not be doing more work towards this as statistically we did not find a more private conscoius way to do ab testing than the one event logging -based experiments offers. You can use eventlogging and wikimediaevents code at this time , there are quite a bit of examples of how to run ab tests on discovery's code.

You can use eventlogging and wikimediaevents code at this time , there are quite
a bit of examples of how to run ab tests on discovery's code.

My concern is mainly with the bucketing mechanism for which no standard (but many self cooked solutions) exists. This is what I would like to see standardized, since it seems to be programmed again and again.

You can use eventlogging and wikimediaevents code at this time , there are quite
a bit of examples of how to run ab tests on discovery's code.

My concern is mainly with the bucketing mechanism for which no standard (but many self cooked solutions) exists. This is what I would like to see standardized, since it seems to be programmed again and again.

@Jan_Dittrich : deterministic bucketing is available as part of wikimedia events, see an example of usage as part of search code: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/ext.wikimediaEvents.searchSatisfaction.js#L100-L158

In T135762#3617324, @Jan_Dittrich wrote:

You can use eventlogging and wikimediaevents code at this time , there are quite
a bit of examples of how to run ab tests on discovery's code.

My concern is mainly with the bucketing mechanism for which no standard (but many self cooked solutions) exists. This is what I would like to see standardized, since it seems to be programmed again and again.

See also the notes by the web team at T168380: Explore an API for logging events sampled by session

Krinkle unsubscribed.Sep 19 2017, 8:33 PM

Deterministic bucketing is also available in MediaWiki core via the mediawiki.experiments RL module.

We abandoned the original intent of this ticket, I think?

Ticket can be closed.

BBlack closed this task as Resolved.Oct 23 2017, 3:50 PM

BBlack claimed this task.

Jdlrobson mentioned this in T233609: [SPIKE 4hrs] What is technically feasible in terms of logged-in/logged-out users?.Sep 25 2019, 4:28 PM

ovasileva subscribed.Oct 16 2019, 5:25 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

This was declined rather than resolved.