Page MenuHomePhabricator

[SPIKE] Identify cause for duplicate A/B test bucketing
Closed, ResolvedPublic



During the analysis of the sticky header a/b test T298873: Analyze results of sticky header A/B test, we identified that 809 sessions were assigned to both the control and test groups. We would like to look into why this is happening in order to avoid it in future tests

Acceptance criteria

  • Determine reasons for duplicate A/B test bucketing

Event Timeline

ovasileva triaged this task as High priority.
ovasileva created this task.
ovasileva added a subscriber: jwang.

809 sessions are assigned to both control and test groups. We excluded them in analysis.

For the sticky header experiment we bucketed users by their user ID: mw.user.getId().toString().
The web_session_id relates to their session and is randomly generated at the start of a browsing session.

So if we're seeing two identical web_session_ids it suggests that the user ID was different for the same user resulting in two different groups.

This situation can happen if the user has two accounts. For example a sock puppet or a staff switching to their volunteer account. They are using the same browser so have the same session ID, however since we're bucketing by user ID they are seeing different treatments for those two accounts.

To support this hypothesis, I ran the query from analysis

SELECT  web_session_id, wiki, count(distinct `group` ) AS groups,  min(meta.dt) AS session_dt 
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki NOT IN ('testwiki','test2wiki')  and year=2022 and month=1 and day BETWEEN 6 AND 30
GROUP BY  web_session_id, wiki
HAVING groups>1

I took some of those session IDs and ran query:

FROM event.mediawiki_web_ab_test_enrollment
WHERE web_session_id = (redacted)
and year=2022 and month=1 and day = 16 LIMIT 16;

I looked at those duplicate rows, and verified that they have similar timestamps, but there was a pattern - if looked at in chronological order the group changes are not random which might suggest someone switching accounts in the same browser.

This hypothesis is strengthened by looking at the edit count bucket for events that occurred around the same timestamps and seeing the editCountBucket is different.

FROM event.desktopwebuiactionstracking
WHERE event.token = <redacted>
and year=2022 and month=1 and day = 16 LIMIT 8;

It works like follows:

  • Jon opens English Wikipedia in his browser. He's assigned session ID A, and he has user ID Z. User ID Z is used to bucket Jon and he is given bucket "stickyHeaderDisabled"
  • Jon logs out. He is no longer bucketed as an anon. His session ID remains at A as he hasn't closed the browser.
  • Jon logs in as Jon (WMF) within the same browser window. He's assigned user ID Y. His session ID remains at A as he hasn't closed the browser. User ID Y however results in a different bucket "stickyHeaderEnabled"

So what this data is telling you is that 809 sessions were by users with multiple accounts.

cc @ovasileva @jwang

That's _very_ interesting. Thanks for doing this! Would it make sense to just test this theory a bit more?

  1. If I understand correctly, 809 is the number of sessions we examined in the context of the sticky header experiment that were used by users with multiple accounts. What percent of the total sessions in the experiment is that?
  1. Could we verify that this ratio is somewhat the same in users outside of the experiment as a kind of sanity check?
  1. Assuming points 1. and 2. line up correctly, should we bucket based on session ID + user ID, so we have a consistent experience per user?
  1. From the analysis my understanding is that is 809 out of 435756 user sessions so 0.18%

All sessions would be the sum of these fields:

Screen Shot 2022-03-28 at 8.56.01 AM.png (288×756 px, 24 KB)

  1. I don't think so. We usually avoid logging user id in events as it identifies users. The only reason here we know that user ID is different is because the A/B test leaked that information by using the page ID for the bucketing. To verify this we'd likely need to run another A/B test using page ID
  1. For logged in users, yes. As long as we do it consistently. However, there may be a privacy cost to doing this if the session ID is derived from the user ID. One thing we could do that may be far simpler is invalidating the session ID on logout. What do you think?
Jdlrobson assigned this task to ovasileva.
Jdlrobson added a subscriber: Jdlrobson.

Seems like we've identified the cause and the ratio is small enough to where we can exclude these sessions from the analysis without larger concerns. Resolving.