Page MenuHomePhabricator

Create A/B test strategy for en- and dewiki tests
Closed, ResolvedPublic5 Estimated Story Points

Description

Background

Prior to deploying the Page Previews feature on en- and dewiki, we would like to perform an A/B test on those wikis to gauge both user behavior and the effects of PP on fundraising.

Currently, given the user's session ID, the client:

  • Divides anonymous users into two buckets, "control" and "on", to determine whether PP should be enabled; we refer to this process as bucketing.
  • Divides all users into two buckets, "control" and "on", to determine whether the EventLogging instrumentation should be enabled; we refer to this process as sampling.

By bucketing and sampling users separately we end up silently dropping a lot of data for either case. This shouldn't be the case.

To solve this we'll not sample users, i.e. we'll collect all data from users in the on and control buckets. In order not to overwhelm the EventLogging pipeline, we'll introduce a third "off" bucket for which we're not collecting data. The sizes of the control and on buckets should remain equal and will be considerably smaller than the size of the off bucket: 0.98:0.01:0.01 (off:control:on).

acceptance criteria

  • On en- and dewiki, anonymous users will be split into three buckets using their session ID (mw.user.sessionId()):
  • experiment (preview on, gathering data)
  • control (previews off, gathering data)
  • off (previews off, not gathering data)
  • If the user falls into a bucket that should be gathering data, then all data is sent to the server.
  • The instrumentation still respects DNT.
  • For all other wikis that the feature is deployed to, 100% of anonymous users should still receive the PP code and the EventLogging instrumentation is disabled.
  • The $wgPopupsAnonsEnabledSamplingRate and PopupsSchemaSamplingRate config variables are removed.
  • The $wgPopupsAnonsExperimentalGroupSize? config variable defines the on/control bucket size
  • Events are not logged for logged in users.
  • There should be a kill switch for EventLogging. This allows us to disable EventLogging in the event we want to enable for a larger bucket size or too many events are being logged.

Closed Questions

  • Should the existing behavior (described in the Background section above) be unaffected on those wikis that the client is currently deployed to?

@phuedx: According to note #2 below, we're going to stop collecting data from other wikis (presumably so that we can collect as much data as possible from en- and dewiki).

Sign off steps

  • Clean up config relating to $wgPopupsAnonsEnabledSamplingRate and $PopupsSchemaSamplingRate (or create a task to do so)

notes

  1. Users in all groups will be able to enable the feature using the footer link (for the on group, the feature will be enabled by default).
  2. Sampling for other wikis should be zero prior to deployment. Tests on other wikis will be turned off.
    • @phuedx: This should be moved into the deploy task.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
phuedx renamed this task from Create a/b test infrastructure for enwiki and dewiki tests for page previews to Create A/B test strategy for en- and dewiki tests.Jul 28 2017, 10:37 AM
phuedx updated the task description. (Show Details)
phuedx subscribed.

I'm confused.. I thought we've run A/B tests for Popups before? What was different about those and this that need additional infrastructure?

Thanks for asking for clarification, @Jdlrobson. I've tried to add a little more context in the Background section.

I'd note also that it's been noted that the existing A/B testing strategy that we've used to Page Previews is "unusual". This work then has two advantages: we make the A/B testing strategy more usual and get rid of the pesky "bucketing/sampling" nomenclature problem as we just have bucketing!

Ping @ovasileva. I think this is ready for @Jdlrobson and the other Readers Web engineers to take a look at /cc @Tbayer

MBinder_WMF set the point value for this task to 5.Aug 1 2017, 4:50 PM
phuedx removed the point value for this task.
phuedx set the point value for this task to 5.

Change 371129 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] WIP: A/B test infrastructure

https://gerrit.wikimedia.org/r/371129

Here's what I'd propose we do (pausing to check I'm not barking up the wrong tree):

  • Update https://meta.wikimedia.org/wiki/Schema:Popups to have a bucket field.
  • getBaseData in src/reducers/eventLogging.js should return the user's current bucket for all events.
  • We continue to use wgPopupsAnonsEnabledSamplingRate to work how the buckets where possible. I've documented how I would expect the behaviour to work here: https://gerrit.wikimedia.org/r/371129
  • When deciding whether to event log we'll look at the buckets

Awaiting feedback before continuing...

Here's what I'd propose we do (pausing to check I'm not barking up the wrong tree):

Can we instead deduce it from the data? If the feature is on then the user is in the 'control' group, if it's off, then the user is in the 'off' group according to the task description.

  • getBaseData in src/reducers/eventLogging.js should return the user's current bucket for all events.

Only if the user is anonymous. Logged in users should not be bucketed, right?

  • We continue to use wgPopupsAnonsEnabledSamplingRate to work how the buckets where possible. I've documented how I would expect the behaviour to work here: https://gerrit.wikimedia.org/r/371129

It wasn't clear from the patch why 50% was used. Can we simplify this by introducing a new config variable, where we can specify the on, off, and control rates?

Here's what I'd propose we do (pausing to check I'm not barking up the wrong tree):

I'm not sure what you're trying to solve here. Could you expand on why this additional field is necessary?

AFAICT it might make visible the situation where an anonymous user disables or enables PP, which would otherwise look like a slight skew of the population /cc @Tbayer

  • We continue to use wgPopupsAnonsEnabledSamplingRate to work how the buckets where possible. I've documented how I would expect the behaviour to work here: https://gerrit.wikimedia.org/r/371129

Ideally the control and on buckets will be the same size so maybe we could use $wgPopupsAnonsEnabledSamplingRate but the name is now very confusing because of the "enabled". It also specifies the size of a bucket and not a sampling rate 😉

Only if the user is anonymous. Logged in users should not be bucketed, right?

No. See AC 1.

Only if the user is anonymous. Logged in users should not be bucketed, right?

No. See AC 1.

I didn't see any mention of logged in users in the A/C. Are you saying logged in users should also be bucketed?

Missed the "not". You're correct. Logged in users shouldn't be bucketed.

I'm not sure what you're trying to solve here. Could you expand on why this additional field is necessary?

I may have misunderstood the purpose of this but the spec says you will have 2 buckets that gather data:
experiment (preview on, gathering data)
control (previews off, gathering data)

I assume we'll want to distinguish between them?

Can we instead deduce it from the data? If the feature is on then the user is in the 'control' group, if it's off, then the user is in the 'off' group according to the task description.

Given a bucketed user can turn the feature off and still log events, an additional field will give us more certainty. In the early stages, it will also allow us easily to check the buckets are the same size and that there are no events being logged for "off".

Only if the user is anonymous. Logged in users should not be bucketed, right?

To clear this up, I was thinking we could either omit the bucket field for logged in users, or set a bucket "loggedin". I should note, we could do neither. The bucket is a just indicator. Logic for isEnabled and deciding whether to EventLog would be informed by it. Whether this matters and what to do really hinges on the answer to the first question - do we need a field for buckets.

It wasn't clear from the patch why 50% was used. Can we simplify this by introducing a new config variable, where we can specify the on, off, and control rates?

I'm not sure what you mean with regards to 50%, but yes we could use yet another config variable, but if we do so I'd rather we got rid of wgPopupsAnonsEnabledSamplingRate. We have way too many config variables in Popups with similar names

Ideally the control and on buckets will be the same size so maybe we could use $wgPopupsAnonsEnabledSamplingRate but the name is now very confusing because of the "enabled". It also specifies the size of a bucket and not a sampling rate 😉

We could rename the config again as part of this e.g. $wgPopupsAnonsEnabledGroupSize. I'm not sure why we have SamplingRate in the name - it was very confusing to me when I started this task as I thought it had something to do with EventLogging but doesn't.

Anyway, it sounds like this may warrant a discussion over standup when everyone is back..? I'm a little confused with what the goal is now.

I assume we'll want to distinguish between [the buckets]?

We capture whether or not previews are enabled in event.popupsEnabled. If we didn't log the bucket, then this could be used to derive the bucket they are in.

Given a bucketed user can turn the feature off and still log events, an additional field will give us more certainty.

You're right. I acknowledge this situation in T171853#3517905.

My position is to not log more data unless we have to. IIRC the rate at which folk disable the feature is low – and a disabled event is logged when they do! – so should only expect a minor change in the bucket size.

In the early stages, it will also allow us easily to check the buckets are the same size and that there are no events being logged for "off".

I accept your point but I would say that we should be checking these things – the latter, definitely – during QA.

I'm not sure what you mean with regards to 50%, but yes we could use yet another config variable, but if we do so I'd rather we got rid of wgPopupsAnonsEnabledSamplingRate. We have way too many config variables in Popups with similar names.

Per AC 4, $wgPopupsAnonsEnabledSamplingRate should be removed if you're not going to use it to define bucket sizes.

Anyway, it sounds like this may warrant a discussion over standup when everyone is back..? I'm a little confused with what the goal is now.

Edit

@Jdlrobson rightly points out that we should have a killswitch for the EventLogging instrumentation. If we remove the $wgPopupsSchemaEnabledSamplingRate config variable, then this will be lost.

@Jdlrobson to add an AC about adding an EventLogging instrumentation killswitch.

Chatted with Sam and have a better idrea of the goal. Will have something ready for review Monday:)

Okay patch is up: https://gerrit.wikimedia.org/r/371129 Popups A/B test infrastructure

At this point, to set expectations, I'd appreciate some code review around:

  • test cases/edge cases I might be missing
  • whether I'm meeting the spec as written

I've been a bit wrapped up in this today and it's quite possible some fresh eyes will find something I've missed. Reviews around code style/refactoring/config names/format are not useful to me at this time and can come later.

Thank you in advance!

This comment was removed by Jdlrobson.

Change 371129 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Popups A/B test infrastructure

https://gerrit.wikimedia.org/r/371129

Jdlrobson added a subscriber: ABorbaWMF.

We should probably get some general QA to verify that there has been no change in service for page previews users as a result of our changes.
@ABorbaWMF can you do a general test of the feature?

I took a quick look at beta and I did not notice any new issues. Should I check on a production site (non-english I'm guessing)?

Nope that was the right thing to do. Sounds like we can be confident the feature is still operating as normal.

The next question is how do we QA the infrastructure.

I did some more looking on a few browsers. So far it looks good.

Change 373171 had a related patch set uploaded (by Niedzielski; owner: Sniedzielski):
[operations/mediawiki-config@master] WIP (DO NOT MERGE): pagePreviews: remove invalidated popup sampling rate variables

https://gerrit.wikimedia.org/r/373171

I believe that signing off on this means a couple things:

  • Remove $wgPopupsAnonsEnabledSamplingRate and $wgPopupsSchemaSamplingRate in mediawiki-config. That's here. It's almost certainly wrong and hasn't been tested in any fashion.
  • Verify each acceptance criterion.

I'm a bit lost on both but especially on verification. Here's what I think I know:

  1. I have my EventLogging configuration working and I can fish out results with something like select * from Popups_16364296 where webhost like 'es.wikipedia.beta.wmflabs.org'; (the schema revision matches master) or maybe grepping for my session ID in all-events.log.
  2. Page previews are not present on mobile Wikipedia because popups are disabled(?) so I only have to test desktop.
  3. Page previews are not available on even the English BC yet so I better check something like the Spanish BC.
  4. Use [[ https://gerrit.wikimedia.org/r/#/c/371129/8/src/getUserBucket.js,unified | getUserBucket() ]] to determine which of three buckets I'm in and verify the state characteristics mentioned in the AC (+ login state, DNT).

Here's what I'm blocked on:

  1. wgPopupsOnControlBucketSize doesn't seem to exist so I'm not sure what I should test.
  2. Testing the kill switch. Should I just set wgPopupsSchemaEnabledSamplingRate to 0 in the browser and verify no popup events are sent when wgPopupsEventLogging is true?

Circling back to the bottom line of this task, if all this stuff works, then when page previews _does_ roll out to English and German wikis, our logging will be bucketed and we can collect the metrics we always wanted.

/cc @Jdlrobson @phuedx

I believe that signing off on this means a couple things:

  • Remove $wgPopupsAnonsEnabledSamplingRate and $wgPopupsSchemaSamplingRate in mediawiki-config. That's here. It's almost certainly wrong and hasn't been tested in any fashion.

This can be done as part of enabling the A/B test itself (see T172291: Launch page previews A/B test on enwiki and dewiki) but thanks for submitting a change!

  1. I have my EventLogging configuration working and I can fish out results with something like select * from Popups_16364296 where webhost like 'es.wikipedia.beta.wmflabs.org'; (the schema revision matches master) or maybe grepping for my session ID in all-events.log.

For those reading along, [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster#How_to_verify_events | if you're going to be testing on the BC, you'll need to be grepping all-events.log on deployment-eventlog02]].

  1. Page previews are not present on mobile Wikipedia because popups are disabled(?) so I only have to test desktop.

👍

Here's what I'm blocked on:

  1. wgPopupsOnControlBucketSize doesn't seem to exist so I'm not sure what I should test.
  2. Testing the kill switch. Should I just set wgPopupsSchemaEnabledSamplingRate to 0 in the browser and verify no popup events are sent when wgPopupsEventLogging is true?

Let's set up the BC such that enwiki and eswiki have $wgPopupsOnControlBucketSize set but only the former has $wgPopupsEventLogging = true

Change 373264 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/mediawiki-config@master] pagePreviews: Enable A/B test (BC-only)

https://gerrit.wikimedia.org/r/373264

@phuedx, thanks for the comments and video reply!

  1. So now we just wait until someone merges your labs patch before we can test and sign off? Does a merge to master = instant deploy?
  2. Is wgPopupsEventLogging now the kill switch? So I just have to visit Spanish BC to verify popups are still enabled but logging is disabled?
  3. Am I supposed to verify the wgPopupsAnonsExperimentalGroupSize group distribution is as intended (25% pp and logging / 25% no pp and logging / 50% no pp and no logging) or just that each of three groups exists?

This can be done as part of enabling the A/B test itself

Feel free to abandon this patch if it makes sense. I only did it because it was marked as a sign-off requirement on this ticket.

@phuedx, thanks for the comments and video reply!

  1. So now we just wait until someone merges your labs patch before we can test and sign off? Does a merge to master = instant deploy?

Yes. We can do that ourselves. If you'd like, then I can walk you through it on a Google Hangout.

  1. Is wgPopupsEventLogging now the kill switch? So I just have to visit Spanish BC to verify popups are still enabled but logging is disabled?

👍

  1. Am I supposed to verify the wgPopupsAnonsExperimentalGroupSize group distribution is as intended (25% pp / 25% no pp and no logging / 50% no pp and no logging) or just that each of three groups exists?

Both. Except that the on and control buckets have logging enabled, provided that $wgPopupsEventLogging is true.

@phuedx, thanks for the comments and video reply!

For posterity, the video reply is here: https://www.youtube.com/watch?v=Ttf_J2jMqnc

Change 373264 merged by jenkins-bot:
[operations/mediawiki-config@master] pagePreviews: Enable A/B test (BC-only)

https://gerrit.wikimedia.org/r/373264

I tried again. Please see my notes below but this doesn't seem to be functioning quite as expected.

On en- and dewiki, anonymous users will be split into three buckets using their session ID (mw.user.sessionId()):

  • 25% on (preview on, gathering data)
  • 25% control (previews off, gathering data)
  • 50% off (previews off, not gathering data)
  • This change is not deployed on BC en or dewikis. With debug=true, logging should occur on all BC wikis but doesn't on en or dewiki.
  • The config variable, wgPopupsAnonsExperimentalGroupSize, appears to have the wrong configuration. It's set to .25 but, if I read the code right, this corresponds to 12.5% on, 12.5% control, and 75% off. I believe .5 is the wanted value. In my experiments on zhwiki, I saw the following distribution: 4 on, 21 control, 0 off. However, now I think that logging is always enabled when debug=true so this would give me a distribution of 12.5% on, 87.5% control, 0% off. I will retest the distribution when the config is updated.

If the user falls into a bucket that should be gathering data, then all data is sent to the server.

  • This works. I verified by monitoring network activity and tail -f /srv/log/eventlogging/all-events.log | grep --line-buffered -E '(ar|es|he|zh)\.wikipedia\.beta\.wmflabs\.org' on the EventLogging server. I will retest the off bucket when I retest the distribution.
  • The schema used is notably outdated.

The instrumentation still respects DNT.

This works.

For all other wikis that the feature is deployed to, 100% of anonymous users should still receive the PP code and the EventLogging instrumentation is disabled.

I did not observe a change on Spanish or English production Wikipedias. You can see that getUserBucket.js isn't even in the bundle on these wikis.

The $wgPopupsAnonsEnabledSamplingRate and PopupsSchemaSamplingRate config variables are removed.

These variables have been eliminated from the Popups repo.

The $wgPopupsAnonsExperimentalGroupSize? config variable defines the on/control bucket size

The Popups repo implementation appears correct, the BC wikis return the expected value for mw.config.get('wgPopupsAnonsExperimentalGroupSize'), and the test distribution offers reasonable support.

Events are not logged for logged in users.

I will reevaluate this when retesting distributions.

There should be a kill switch for EventLogging. This allows us to disable EventLogging in the event we want to enable for a larger bucket size or too many events are being logged.

I'm not sure if this is working. It's false on Chinese, Spanish, Arabic, and Hebrew and null on English and German BC wikis. I never witnessed logging without debug=true so I'm not sure.

Change 373531 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/mediawiki-config@master] pagePreviews: Re-enable on Beta Cluster

https://gerrit.wikimedia.org/r/373531

Change 373531 merged by jenkins-bot:
[operations/mediawiki-config@master] pagePreviews: Re-enable on enwiki and dewiki (BC-only)

https://gerrit.wikimedia.org/r/373531

Mentioned in SAL (#wikimedia-operations) [2017-08-24T13:19:20Z] <phuedx@tin> Synchronized wmf-config/InitialiseSettings-labs.php: T171853: Re-enable Page Previews for enwiki and dewiki on the Beta Cluster (duration: 00m 47s)

Change 373541 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/mediawiki-config@master] pagePreviews: Bump on/control group size to 25% (BC-only)

https://gerrit.wikimedia.org/r/373541

Change 373541 merged by jenkins-bot:
[operations/mediawiki-config@master] pagePreviews: Bump on/control group size to 25% (BC-only)

https://gerrit.wikimedia.org/r/373541

I tried again. Please see my notes below but this doesn't seem to be functioning quite as expected.

On en- and dewiki, anonymous users will be split into three buckets using their session ID (mw.user.sessionId()):

  • 25% on (preview on, gathering data)
  • 25% control (previews off, gathering data)
  • 50% off (previews off, not gathering data)
  • This change is not deployed on BC en or dewikis. With debug=true, logging should occur on all BC wikis but doesn't on en or dewiki.
  • The config variable, wgPopupsAnonsExperimentalGroupSize, appears to have the wrong configuration. It's set to .25 but, if I read the code right, this corresponds to 12.5% on, 12.5% control, and 75% off. I believe .5 is the wanted value. In my experiments on zhwiki, I saw the following distribution: 4 on, 21 control, 0 off. However, now I think that logging is always enabled when debug=true so this would give me a distribution of 12.5% on, 87.5% control, 0% off. I will retest the distribution when the config is updated.

Both of these points were addressed by the changes above (373531 and 373541).

works but please note that the schema is outdated.

On en- and dewiki, anonymous users will be split into three buckets using their session ID (mw.user.sessionId()):

  • 25% on (preview on, gathering data)
  • 25% control (previews off, gathering data)
  • 50% off (previews off, not gathering data)

I identified buckets by behavior. The following distribution was seen on en BC: 3 on, 5 control, 11 off. The killswitch is enabled on de BC so the following distribution was noted: 7 on (without logging), 17 off or control.

If the user falls into a bucket that should be gathering data, then all data is sent to the server.

en logs when not off. The server receives events like:

{"event": {"action": "dismissed", "hovercardsSuppressedByGadget": false, "isAnon": true, "linkInteractionToken": "404214c9a17b06e8", "namespaceIdHover": 0, "namespaceIdSource": -1, "pageIdSource": 0, "pageTitleHover": "Ictonyx", "pageTitleSource": "Search", "pageToken": "0d32eb3c43a46c27", "perceivedWait": 735, "popupEnabled": true, "previewCountBucket": "5-20 previews", "previewType": "page", "sessionToken": "c2cec9acc1260a11", "totalInteractionTime": 1127}, "recvFrom": "deployment-cache-text04.deployment-prep.eqiad.wmflabs", "revision": 16364296, "schema": "Popups", "seqId": 3464915, "timestamp": 1503597182, "userAgent": "{\"os_minor\": null, \"is_bot\": false, \"os_major\": null, \"device_family\": \"Other\", \"os_family\": \"Ubuntu\", \"browser_minor\": \"0\", \"wmf_app_version\": \"-\", \"browser_major\": \"60\", \"browser_family\": \"Chromium\", \"is_mediawiki\": false}", "uuid": "5d42d7dcfccb539a9716205d53bd7a64", "webHost": "en.wikipedia.beta.wmflabs.org", "wiki": "enwiki"}

The instrumentation still respects DNT.

This still works.

For all other wikis that the feature is deployed to, 100% of anonymous users should still receive the PP code and the EventLogging instrumentation is disabled.

I still did not observe a change on Spanish or English production Wikipedias.

The $wgPopupsAnonsEnabledSamplingRate and PopupsSchemaSamplingRate config variables are removed.

These variables have been eliminated from the Popups repo.

The $wgPopupsAnonsExperimentalGroupSize? config variable defines the on/control bucket size

The Popups repo implementation appears correct, the BC wikis universally return the expected value for mw.config.get('wgPopupsAnonsExperimentalGroupSize'), and the test distribution still offers reasonable support.

Events are not logged for logged in users.

The initial state of popups for logged in users is disabled. No logging occurs when disabled or occurs.

There should be a kill switch for EventLogging. This allows us to disable EventLogging in the event we want to enable for a larger bucket size or too many events are being logged.

ar, de, es, he, and zh BC wikis do not log and mw.config.get('wgPopupsEventLogging') is false. mw.config.get('wgPopupsEventLogging') is true on en BC and logging occurs in the on and control groups.

Clean up config relating to $wgPopupsAnonsEnabledSamplingRate and $PopupsSchemaSamplingRate (or create a task to do so)

A WIP patch exists.

the schema is outdated.

My mistake. The changes to the schema are just description edits.

Follow up work is captured in T174075 . Thanks for checking this so thoroughly @Niedzielski !