Page MenuHomePhabricator

Create harness for the language switcher A/B test
Closed, DeclinedPublic

Description

Readers Web will be A/B testing the existing treatment of the language switcher and a new treatment being worked on as part of Desktop Improvements (Vector 2022). The initial cohort will be all users (not just logged-in users).

AC

  • When I visit a wiki with the A/B test enabled, I receive either the existing treatment or the new one
  • When I visit a wiki with the A/B test enabled and I have the magic query string parameter set then it should have the following effects:
ValueEffect
undefinedI'm entered into the A/B test and bucketed as usual
controlI see the existing treatment
AI see the new treatment

Developer Notes

We can't bucket logged-out users on the server, which complicates things:

We'll have to deliver the page in an "undefined" state until we can make a determination, i.e. we'll have to ship both features in a working state and then reveal one of them to the user. It's worth noting again that, since all JavaScript is loaded asynchronously, there's a small risk that a user who has previously expanded the sidebar, might see a list of 300+ languages suddenly appearing.

Further, we'll have to ensure that users with JavaScript disabled will receive an experience, e.g.

.client-js.vector-language-ab-test {
  .mw-portlet-lang, .vector-language-switcher {
    /* Initial state for all users in the A/B test. */
    display: none;
  }
}

Finally, the ULS will have to be changed to handle the UI element not being present when it loads. At the time of writing, the ULS binds click event handlers to specific UI elements. If we made the ULS rely on those click events bubbling to the body element, then it would be agnostic of when the UI element was added to the page, i.e.

UniversalLanguageSelector/resources/js/ext.uls.interface.js
// L382
$( document.body ).on(
  'click',
  '.uls-trigger',
  clickHandler // Defined further up in initInterface()
);

Event Timeline

Is old treatment languages in the sidebar? Is new treatment language in the header?

Readers Web will be A/B testing the existing treatment of the language switcher and a new treatment being worked on as part of Desktop Improvements. The initial cohort will be all users (not just logged-in users).

Note, given the old and new treatment require HTML changes I think an A/B test could only feasibly be run here for logged in users. Is that a problem? Is that an explicit requirement? Is the A/B test bucketing planned based on page title?

i.e. we'll have to ship both features in a working state and then reveal one of them to the user.

I don't think shipping both is possible as the language list makes heavy use of IDs which gadgets may or may use and rendering both would create elements with duplicate IDs...

If we're prepared to break the non-JS version of the button we could adapt the existing implementation to always render in the sidebar and just render the button but this would require a considerable rearchitecture of the existing code and our feature flags and that might be workable (I wish I'd known about an anon A/B test earlier!).

Is old treatment languages in the sidebar? Is new treatment language in the header?

Yes.

Readers Web will be A/B testing the existing treatment of the language switcher and a new treatment being worked on as part of Desktop Improvements. The initial cohort will be all users (not just logged-in users).

Note, given the old and new treatment require HTML changes I think an A/B test could only feasibly be run here for logged in users. Is that a problem? Is that an explicit requirement? Is the A/B test bucketing planned based on page title?

It would be preferable to run the test across all users, as we currently have little information on language switching among anons.

i.e. we'll have to ship both features in a working state and then reveal one of them to the user.

I don't think shipping both is possible as the language list makes heavy use of IDs which gadgets may or may use and rendering both would create elements with duplicate IDs...

If we're prepared to break the non-JS version of the button we could adapt the existing implementation to always render in the sidebar and just render the button but this would require a considerable rearchitecture of the existing code and our feature flags and that might be workable (I wish I'd known about an anon A/B test earlier!).

Would it be possible to estimate how difficult this approach would be? Also, to confirm, this would allow us to test for both treatments as indicated above?

Definitely feasible, provided we're okay with a performance penalty (delay) in show languages demonstrated below:

langab.gif (440×883 px, 147 KB)

POC: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/667716

A/B test will also need to consider cached HTML and the fact that the language button is tied to the new indicators placement.

Is old treatment languages in the sidebar? Is new treatment language in the header?

Readers Web will be A/B testing the existing treatment of the language switcher and a new treatment being worked on as part of Desktop Improvements. The initial cohort will be all users (not just logged-in users).

Note, given the old and new treatment require HTML changes I think an A/B test could only feasibly be run here for logged in users.

Not necessarily. We can, for example, ship both features in a working state but in the control state until the A/B test is enabled. We trade off a higher degree of accuracy in our analysis (including being able to draw conclusions about the behaviour of anons in this context) against performance and complexity for 3 weeks (2 weeks for the A/B test + 1 week for tidy up).

Is the A/B test bucketing planned based on page title?

Great question. Using page random as the unit of variance would mean that we don't have to ship both treatments at the same time, thereby reducing the performance cost of the A/B test, but at the cost of a slightly more fraught rollout as you have less granular control over when the A/B test is enabled for everyone (the A/B test is enabled on the page when it's invalidated in our edge cache(s)).

i.e. we'll have to ship both features in a working state and then reveal one of them to the user.

I don't think shipping both is possible as the language list makes heavy use of IDs which gadgets may or may use and rendering both would create elements with duplicate IDs...

This seems worthy of investigation. Should this be moved back into Needs Analysis?

Nice work on the proof of concept!

<snip /> and the fact that the language button is tied to the new indicators placement.

Could you expand on this?

Not necessarily. We can, for example, ship both features in a working state but in the control state until the A/B test is enabled. We trade off a higher degree of accuracy in our analysis (including being able to draw conclusions about the behaviour of anons in this context) against performance and complexity for 3 weeks (2 weeks for the A/B test + 1 week for tidy up).

Yes, we can ship both features in a working state, but we would need to consider how this is enabled with potential performance implications and complexity as highlighted in T275807#6872967

This seems worthy of investigation. Should this be moved back into Needs Analysis?

Provided we build one of the experiences in JavaScript we can avoid this issue. If we want to make the change in HTML this would delay things as we'd need to transition the new language button to use unique IDs and not be compatible with existing gadgets.

<snip /> and the fact that the language button is tied to the new indicators placement.
Could you expand on this?

There is a config flag $wgVectorLanguageInHeader which currently moves the language button to the header however it also changes the position of indicators in the DOM. I don't think this should impact the A/B test, however if it does, we will want to disentangle the indicator change from the button, preferably by shipping it prior to the A/B test and leaving some time for users to get used to it.

ovasileva moved this task from Upcoming to Needs Analysis on the Readers-Web-Backlog (Kanbanana-FY-2020-21) board.

Olga will setup a special meeting to talk through the options here. As far as I see we can do 1 of 3 things and we need to make a decision on which approach to take before estimating:

  1. A/b test includes anonymous users, with accepted performance trade off (trade off: delayed display of languages)
  2. A/b test includes anonymous users, without performance trade offs (this would make use of inline JavaScript, added complexity) and require rendering the old and new treatment in the HTML which will require several tasks to modify the new language HTML so it doesn't duplicate element IDs. This may also result in showing and hiding the language button to avoid performance trade offs (trade off: requires more time and effort, possible confusing UX)
  3. We do an A/B test for logged in users only. (trade off: no anons in a/b test, smaller audience)

Thanks for the summary, @Jdlrobson.

I think there's another option between #1 and #2, which involves shipping both treatments in a visible loading state, which is a tried-and-tested method to give quick feedback to the user that something is happening thereby improving perceived performance.

See also T268504#6796185 for some discussion.

We talked today about implementing option 1 and adding timing metrics to get a sense of the impact before considering 2.

We talked today about implementing option 1 and adding timing metrics to get a sense of the impact before considering 2.

Next steps: putting option 1 on the beta cluster and testwiki to test the delay

Change 667716 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/Vector@master] A/B test for language button for anonymous users

https://gerrit.wikimedia.org/r/667716

Change 673151 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/Vector@master] Inform anonymous A/B test by tracking time from DOM interactive

https://gerrit.wikimedia.org/r/673151

Following up on this now we're actively investigating how to run this A/B test on logged out users:

i.e. we'll have to ship both features in a working state and then reveal one of them to the user.

I don't think shipping both is possible as the language list makes heavy use of IDs which gadgets may or may use and rendering both would create elements with duplicate IDs...

I quick global search in the gadget, gadget talk, and gadget definition namespaces yields no results for 'p-lang'. However, a code search for the same ID yields a host of codebases.

Is there a list of IDs that you had in mind?

Change 673151 merged by jenkins-bot:
[mediawiki/skins/Vector@master] Inform anonymous A/B test by tracking time from navigationStart

https://gerrit.wikimedia.org/r/673151

Following up on this now we're actively investigating how to run this A/B test on logged out users:

i.e. we'll have to ship both features in a working state and then reveal one of them to the user.

I don't think shipping both is possible as the language list makes heavy use of IDs which gadgets may or may use and rendering both would create elements with duplicate IDs...

I quick global search in the gadget, gadget talk, and gadget definition namespaces yields no results for 'p-lang'. However, a code search for the same ID yields a host of codebases.

Is there a list of IDs that you had in mind?

#p-lang, #p-lang-label

This might not be a problem in the wild, but it's also possible for arbitary IDs to be added via hook eg..

$wgHooks['SidebarBeforeOutput'][] = function ( $a, &$portlets ) {
	$portlets['LANGUAGES'][]= [ 'id' => 'klingon', 'text' => 'klingon', 'lang' => 'kling'];
};

Moved task to upcoming.
Next steps:

  • @phuedx will backport on 24th March
  • After offsite we will review data and make a decision about this task.
  • Estimate if we decide to go ahead

Change 674382 had a related patch set uploaded (by Phuedx; owner: Jdlrobson):
[mediawiki/skins/Vector@wmf/1.36.0-wmf.36] Inform anonymous A/B test by tracking time from navigationStart

https://gerrit.wikimedia.org/r/674382

Change 674382 merged by jenkins-bot:
[mediawiki/skins/Vector@wmf/1.36.0-wmf.36] Inform anonymous A/B test by tracking time from navigationStart

https://gerrit.wikimedia.org/r/674382

Mentioned in SAL (#wikimedia-operations) [2021-03-25T11:44:08Z] <ladsgroup@deploy1002> Synchronized php-1.36.0-wmf.36/skins/Vector/resources: [[gerrit:674382|Inform anonymous A/B test by tracking time from navigationStart (T275807)]] (duration: 01m 09s)

Here's a graph of the lower bound, p50, and p75 of the Vector.ready timer in Graphite: https://graphite.wikimedia.org/S/r

Option 1 and its disadvantages

For clarity, I'm defining option 1 as having the server make the html for one of the language widgets but that widget would be hidden until our javascript either reveals it or moves it/transforms it into the other language widget based on the A/B bucket determined by our javascript.

One of the main downsides to this approach is that the language links in the sidebar would no longer be immediately visible to the user when that part of the page is painted by the browser (the status quo). From my anecdotal experience, modern Vector's first paint includes the rendering of the language links most of the time. With option 1, the languages would only be visible after the relevant javascript executes.

What our instrumentation measured

Given this, our instrumentation measured the number of miliseconds from the start of navigation (e.g. after the user enters en.wikipedia.org into their address bar and presses enter) to the point at which the JS in Vector starts to execute. The duration represents the earliest that the javascript would execute and could make the widget visible to the user.

What our instrumentation did not measure

It's important to keep in mind that our instrumentation did not collect data about the point at which the user first sees part of the site (first paint) or the point the user starts to interact with the site or would want to try to change languages (which would be much harder to measure). In other words, an unknown amount of the time represented by these graphs is time the user is waiting for the part of the page containing the language widget to render. That unknown amount of time is going to be a factor regardless of the option we choose for this ticket.

p50, p75 results

Having said that, the results suggest that 50% of users would see the language treatment within ~1.7 seconds after the start of navigation, and 75% of users would see the language treatment within ~2.5 seconds after the start of navigation. IMO, because these times are pretty fast, this population of people won't be affected as much by the delay caused by option 1.

p95, p99 results

The p95 is more variable than the others ranging anywhere from 5 seconds - 45 seconds, but most of the time staying in the 5 - 10 second range. The p99 is variable as well with even larger extremes. The p95, p99 population of people I'm much more concerned about as they could experience a significant amount of time without being able to see/interact with the language widget, However, I'm very curious what the first paint time is for this population and how that compares to the start of JS execution time we measured in our instrumentation. If the first paint time is within several seconds to the JS time, the delay may not matter as much vs. if the delta is > 10 or 20 seconds.

Conclusion

Given these results, I'm most concerned about the p95, p99 crowd of people and I think some more data (e.g. first paint info or some other reference point) could be helpful to make a more informed call, but I understand we have time constraints. We have several options. Perhaps we can schedule another meeting to talk about the next steps as a group?

cc @ovasileva, @Jdlrobson

Conclusion

Given these results, I'm most concerned about the p95, p99 crowd of people and I think some more data (e.g. first paint info or some other reference point) could be helpful to make a more informed call, but I understand we have time constraints. We have several options. Perhaps we can schedule another meeting to talk about the next steps as a group?

cc @ovasileva, @Jdlrobson

Thank you for reviewing this @nray! I agree that the p95 and p99 would be of largest concern here. Scheduled a meeting early next week to discuss in more detail.

What our instrumentation did not measure

It's important to keep in mind that our instrumentation did not collect data about the point at which the user first sees part of the site (first paint)

See https://grafana.wikimedia.org/d/4SfKp_XGk/t275807?viewPanel=2&orgId=1 for a plot of NavigationTiming's firstPaint metric for the desktop platform (i.e. Vector) for all users alongside our vector.ready metric. Note well that you can select p50, p75, p95, and p99 for both metrics.

… or the point the user starts to interact with the site or would want to try to change languages (which would be much harder to measure).

The UniversalLanguageSelector instrument does record the number of milliseconds from when the instrument begins executing. Note well that this is not directly comparable to the above because they don't share the same time origin. That said, the p50, p75, and p95 percentiles of this metric are all typically an order of magnitude larger than the above (and, e.g. the same percentiles NavigationTiming's loadEventEnd metric):

+-----+---------+----------+-----------+-----------+
| day | median  |   p_75   |   p_95    |   p_99    |
+-----+---------+----------+-----------+-----------+
|   1 |   15475 | 34628.25 | 112324.25 | 215615.18 |
|   2 |   16468 |    34298 | 102582.55 | 295264.75 |
|   3 |   16072 |  33065.5 |  153204.1 | 550102.43 |
|   4 |   17584 |    37809 |    172180 | 1162738.4 |
|   5 | 17095.5 |    34890 |  209532.5 |  930779.7 |
|   6 |   16449 |  28576.5 |  100583.1 |  512025.9 |
|   7 |   15384 |  29142.5 |   91348.9 | 342737.72 |
|   8 |   15606 |  30855.5 |  141465.6 | 837037.16 |
|   9 |   17354 | 33196.25 |  148581.5 | 421404.14 |
|  10 |   16922 |  33505.5 |  150478.4 | 572283.76 |
|  11 | 17637.5 | 34359.25 |    112454 | 338986.52 |
|  12 |   16955 |    31440 |  123101.8 | 596427.12 |
|  13 |   17379 |  31495.5 |  108310.4 | 523139.82 |
|  14 |   17050 | 32645.25 | 152746.15 | 893564.69 |
|  15 |   16016 |    30851 |  125581.2 | 493262.36 |
|  16 | 16031.5 |  33114.5 |  117720.2 | 477142.57 |
|  17 |   17020 | 35059.75 | 125467.15 | 326368.08 |
|  18 | 16237.5 |  32233.5 | 118852.15 | 470346.86 |
|  19 | 15490.5 |  31041.5 | 103469.75 | 513156.09 |
|  20 | 18681.5 | 33026.75 |  96995.05 |  356272.1 |
+-----+---------+----------+-----------+-----------+

select
  day,
  round( percentile( cast( event.timetochangelanguage as bigint ), 0.5 ), 2 ) as median,
  round( percentile( cast( event.timetochangelanguage as bigint ), 0.75 ), 2 ) as p_75,
  round( percentile( cast( event.timetochangelanguage as bigint ), 0.95 ), 2 ) as p_95,
  round( percentile( cast( event.timetochangelanguage as bigint ), 0.99 ), 2 ) as p_99
from
  event.universallanguageselector
where
  -- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/672742, included in the -wmf.37 branch was deployed on 1st April 2021.
  year = 2021
  and month = 04
  and event.action = 'language-change'
group by
  day
order by
  day asc
limit 10000
;
Jdlrobson added a subscriber: nray.

we met today to talk about this and we are going to decline it. Assigning Olga to summarize,

Thank you @nray, @phuedx, @Jdlrobson for all your work here. We met yesterday and discussed the timing analysis as well as our options:

  • Running the A/B test as outlined in the plan given the times to load from above
  • Running a script that delays page load until the language button is loaded
  • Performing a before and after analysis in which we collect data from the current implementation for a week prior to

    Due to significant concerns raised in the accuracy of the data based on the button loading too late/flashing, we decided to move forward with the before/after analysis.

Next steps and clarification:

  • no A/B test for logged-out users. Here, we will perform a before and after comparison for usage
  • will perform an A/B test for logged-in users. This test will allow us to see how the changes affect logged-in specific use cases (ex: switching languages for purposes of translation or reference during editing)

What this means for deployments:

  • Once T275762: Instrument clicks to links in the Languages list in the sidebar is complete, we will enable the instrumentation and collect data for 2 weeks.
  • During this time, we will deploy the new functionality to all logged-in users outside of the pilot wikis during this time and perform QA
  • Once the 2 week data collection period is over, we will deploy the new interface to the pilot wikis as follows:
    • 100% for all logged-out users
    • Bucket all logged in users and perform an A/B test
  • After two weeks, enable to 100% for all users on pilot wikis and begin analyzing the data

Re-opening this task so it can be repurposed for the harness required for A/B testing for logged-in users only

Change 667716 abandoned by Jdlrobson:

[mediawiki/skins/Vector@master] A/B test for language button for anonymous users

Reason:

https://gerrit.wikimedia.org/r/667716

I have opened T280825 to avoid confusion between requirements and discussion here, to preserve this discussion for prosperity.