Page MenuHomePhabricator

Schema QA: mediawiki_web_ab_test_enrollment
Closed, ResolvedPublic

Description

As one of the instrumentations for ToC ( table of content), Web team will deploy the instrumentation to enable AB test on the table of content. (T302046)

Instrumentation note

The test group enrollment events are stored in mediawiki_web_ab_test_enrollment schema.
As of 2022-04-27, user bucketing is based on what page user lands on first, and then persistent for that session. The page_id of the landing page is the salt to decide which test group the session will be in. Session_id is used to identify user session, stored in field web_session_id.

Test population: logged in users and anonymous users
Group split: Sample entire population, 50% to control group, 50% to test group.

QA summary

What has been checkedStatusNote
wiki✅passEnabled on euwiki and hewiki since 2022-04-26
Check experiment group assignment✅passSchema captured two groups: control , treatment , with experiment_name='skin-vector-toc-experiment'
Check daily pageviews and sessions by groups✅passThe schema runs on every page view for users that are inside the A/B test while the A/B test is running
Check bucketing✅passThe number of unique sessions in control and treatment is close to 1:1
Check whether mediawiki_web_ab_test_enrollment can join with click events on stiky header✅passTwo schemas can be joined by condition DesktopWebUIActionsTracking.event.token =mediawiki_web_ab_test_enrollment.web_session_id and event.name = 'ui.toc'
Check whether mediawiki_web_ab_test_enrollment can join with scroll events✅passTwo schemas can be joined by web_session_id .
Check whether mediawiki_web_ab_test_enrollment can join with reading depth events✅passTwo schemas can be joined by mediawiki_reading_depth.session_token = mediawiki_web_ab_test_enrollment.web_session_id
Check the sessions in both treatment and control groups26% of sessions are in two groups on euwiki, 35% of sessions are in tow groups on hewiki
Events by spiders✅passIdentified 31 sessions are from Spider on hewiki, 0 from euwiki.

Bugs/Potential Issues

IssueStatusNote
web_pageview_id is NULLIn current AB test instrument, we randomize sessions by the page_id of the landing page. Should we also record the page_id for the purpose of data sanity check and analysis?
Check the sessions in both treatment and control groups26% sessions on euwiki are in two groups , 35% sessions on hewiki are in two groups

Event Timeline

jwang created this task.
jwang added a subscriber: Jdrewniak.

@cjming , @Jdlrobson
Not sure whether you have already knew about it. Here is the percentage of the sessions in both control group and treatment group on euwiki and hewiki.

What is tested:
Check duplicated assignment. How many sessions are assigned into both control group and treatment group?

Bugs or potential issues
On euwiki, 26% of sessions are assigned to both control group and treatment group.
On hewiki, 35% of sessions are assigned to both control group and treatment group.

wikigrouptotal sessionsdup_sessionsdup_rate
euwikicontrol25464673326.44%
euwikitreatment25622673326.28%
hewikicontrol48071517059835.49%
hewikitreatment48298117059835.32%
jwang updated the task description. (Show Details)

As part of investigating the bucketing for the TOC A/B test, I was able to replicate locally the multiple enrollment events that get logged within a single session using the sampling rates that were configured on 4/26/22 for hewiki + euwiki. When navigating to different pages in the same session, whether in a new window or a new tab, an A/B test enrollment event gets triggered with varying group assignments. And as a reader, I see the new sticky TOC on one page and if I navigate to another page by NOT linking from the previous page, I can see the legacy TOC.

This is also the case on euwiki wherein the same session, I can see sticky TOC on one page and if I open another page by just pasting a new URL in a new tab, I can see the legacy TOC:

Screen Shot 2022-05-05 at 11.35.53 PM.png (2×4 px, 2 MB)

Apparently there is a workaround using query parameters that overrides this behavior so that if you navigate to new pages by clicking on links from within the current page, whatever TOC is visible on the current page will be carried over to the next pages using query params. As confirmed by @nray, we are trying to preserve a user's bucket by highjacking clicked links within the current page using the tableofcontents query param. Notably though, the same web_session_id having different buckets is a drawback of bucketing based on article id in order to test anonymous users. This bucket preservation effort can be easily bypassed by navigating to urls directly, clicking links via organic search, etc. As noted in the ACs and QA steps of T302046, the lack of robustness of this current implementation is expected and its limitations based on a week's worth of data is confirmed.

I'm unclear whether the expected behavior is that only a single enrollment event should be logged per web_session_id - if this is the case, then the TOC experiment needs to be adjusted accordingly.

Seeing multiple A/B test enrollment events for a single web_session_id on my local is corroborated by the following queries that I've run in Hue to try to get a sense of how many duplicate records are being ascribed to the same web_session_id on a given day for a particular wiki (i.e. hewiki since it returns numbers of records many orders of magnitude over euwiki).

For instance, looking at events for hewiki on 5/5/22, there is a significant difference between counts of distinct vs cumulative web_session_ids for the treatment group:

SELECT count(distinct web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki' AND `group` = 'treatment' AND year = 2022 AND month = 5 AND day = 5

returns 47,883 rows
while

SELECT count(web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki' AND `group` = 'treatment' AND year = 2022 AND month = 5 AND day = 5

returns 106,168 rows (more than double)

This is also the case with the control group --

SELECT count(distinct web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki' AND `group` = 'control' AND year = 2022 AND month = 5 AND day = 5

returns 62,541 rows
while

SELECT count(web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki' AND `group` = 'control' AND year = 2022 AND month = 5 AND day = 5

returns 140,076 rows (again more than double)


When looking at the numbers of unique web_session_ids that have both control and treatment group rows, the same pattern persists.

SELECT count(distinct web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE year = 2022 AND month = 5 AND day = 5 and wiki = 'hewiki' AND `group` IN ('control', 'treatment')

returns 100,581 rows
while

SELECT count(web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE year = 2022 AND month = 5 AND day = 5 and wiki = 'hewiki' AND `group` IN ('control', 'treatment')

returns 277,218 rows (significantly more than double -- meaning many web_session_ids are logging multiple events for both control and treatment groups)

By comparison the same 2 queries above but limited to 5/4/2022

SELECT count(distinct web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE year = 2022 AND month = 5 AND day = 4 and wiki = 'hewiki' AND `group` IN ('control', 'treatment')

returns 129,965 rows
while

SELECT count(web_session_id)
FROM event.mediawiki_web_ab_test_enrollment
WHERE year = 2022 AND month = 5 AND day = 4 and wiki = 'hewiki' AND `group` IN ('control', 'treatment')

returns 325,102 rows (again more than double)


To examine more granularly how many group assignments a single web_session_id might have, the following queries and ensuing screenshots of results for hewiki on 5/4/22 reveal that depending on the user's browsing activity, a single web_session_id can have anywhere from a few to over 50 group assignments.

For reference, the following query was what @jwang used to identify duplicate sessions:

SELECT  web_session_id, wiki, count(distinct `group` ) AS groups
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki NOT IN ('testwiki','test2wiki')  and year=2022 and month IN (4,5)
GROUP BY  web_session_id, wiki
HAVING groups>1

I adapted the above query that @jwang used to determine duplicate groups per web_session_id in order to get cumulative row counts per wiki per day:

SELECT COUNT(*) FROM (SELECT web_session_id, count(distinct `group` ) AS groups
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki'  and year=2022 and month = 5 and day = 4
GROUP BY web_session_id
HAVING groups > 1) t2

returns 25,264 rows

Here's a graph of the above query over the past week:

Screen Shot 2022-05-05 at 9.38.27 PM.png (1×1 px, 192 KB)

Same query without distinct keyword:

SELECT COUNT(*) FROM (SELECT web_session_id, count(`group` ) AS groups
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki'  and year=2022 and month = 5 and day = 4
GROUP BY web_session_id
HAVING groups > 1) t2

returns 51,429 rows

Here's a graph of the above query (without distinct keyword) over the past week:

Screen Shot 2022-05-05 at 9.49.55 PM.png (1×1 px, 192 KB)

To see a sampling of how many different groups a single web_session_id can have, here's another variation of the query with the cumulative group numbers per web_session_id:

SELECT web_session_id, count(`group`) AS groups
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki'  and year=2022 and month = 5 and day = 4
GROUP BY web_session_id
HAVING groups > 1

returns several rows where groups range from 2 to 50+ (I'm not sure what the highest number is - this is where my SQL query chops peter out):

Screen Shot 2022-05-05 at 9.08.29 PM.png (2×1 px, 236 KB)

Screen Shot 2022-05-05 at 9.08.46 PM.png (2×1 px, 121 KB)


To drill down further, I ran the following query to get a random sampling from 5/5/22:

SELECT * FROM (SELECT web_session_id, count(`group` ) AS groups
FROM event.mediawiki_web_ab_test_enrollment
WHERE wiki = 'hewiki'  and year = 2022 and month = 5 and day = 5
GROUP BY web_session_id
HAVING groups > 1) t2 LIMIT 10

which returned:

Screen Shot 2022-05-05 at 10.27.50 PM.png (1×1 px, 182 KB)

I queried for the single web_session_id that had 24 groups and here is an example of the data returned:

Screen Shot 2022-05-05 at 10.35.18 PM.png (1×4 px, 870 KB)


TL;DR the TOC instrument is logging multiple rows with varying group assignments for TOC A/B test enrollment per single web_session_id (i.e. different buckets per page view per single/same web_session_id) - not for every single session but for a significant portion of them. I'm not sure we can avoid this given how bucketing is currently implemented. And I'm not sure what next steps can be taken to try to mitigate this after analyzing a week's worth of data for hewiki + euwiki (above queries are pulled from hewiki -- euwiki does show similar patterns though on a much smaller scale).

FWIW, the fix we backported yesterday (T307019) was to prevent PHP undefined offset errors during bucketing which leverages the PageSplitter instrument. That fix was rolled out to hewiki as a group1 wiki at UTC 20:00 on 5/4/22. I included queries for hewiki from both 5/4/22 and 5/5/22 above to indicate that the duplicate sessions were still happening even after the backport. Numbers being lower from 5/4 to 5/5 notwithstanding, there's clearly still many multiple TOC enrollment events with different buckets being logged per web_session_id.

@cjming, I raised the same question regarding duplicated records for the same session_id and group assignment in sticky header instrumentation QA T297852. Please find @Jdlrobson 's answer at T292587#7576512.

Thanks @jwang for the reminder.

If we're getting roughly 1/4-1/3 of sessions in both buckets (control + treatment), does that mean the data is still useful in your analysis if you can join with other schemas on the web_session_id?

Also @jwang re: web_pageview_id returning NULL in the data sets -- I believe we can add page ids by populating the value for web_pageview_id. I'm not sure though if there's a privacy issue with including them - presumably if they're hashed it's ok? I'll bring it up at our next team meeting on Monday and follow up here with next steps on that piece.

cc @ovasileva @Jdlrobson @nray

@ovasileva to read over discussion and summarize next steps

Apparently there is a workaround using query parameters that overrides this behavior so that if you navigate to new pages by clicking on links from within the current page, whatever TOC is visible on the current page will be carried over to the next pages using query params. As confirmed by @nray, we are trying to preserve a user's bucket by highjacking clicked links within the current page using the tableofcontents query param. Notably though, the same web_session_id having different buckets is a drawback of bucketing based on article id in order to test anonymous users. This bucket preservation effort can be easily bypassed by navigating to urls directly, clicking links via organic search,

As mentioned on Slack, another way this query param can be bypassed currently is by clicking on a search result in the typeahead search component. Should there be a ticket for this?

jwang updated the task description. (Show Details)

Discussed in standup and due to the observational character of the analysis here, I think we can proceed without adding the page ids. cc @jwang