Page MenuHomePhabricator

[SPIKE] Estimate New Discussion Tool A/B Test Feasibility
Closed, ResolvedPublic

Description

This task is about estimating what would be required to run an A/B test to evaluate the impact the New Discussion Tool is having on Juniors Contributors' ability to successfully start new discussions without causing a significant increase in disruption to other volunteers.

Open question(s)

  • What would be required (see === Requirements) to run an A/B test to determine the extent to which the New Discussion Tool is causing a change in the metrics listed below?
    • Metrics [ii]
      • I. How likely Junior Contributors who open a new discussion workflow are to publish a new discussion to a talk page and
      • II. The percentage of published new topic edits that are reverted.

Requirements

The answer to the question above should, at a minimum, include estimates of the following:

  • The number of wikis the A/B test would need to be run on
  • For how long the A/B test would need to be run on
Done
  • Answers to the ===Open question(s) above are answered and documented on this ticket

i. Talk pages project/New Discussion/Measurement Plan

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.
ppelberg moved this task from Backlog to Analytics on the Editing-team (Tracking) board.

Details of the Reply Tool AB test set-up for reference:

  • Run from 11 February 2021 through 10 March 2021 (4 weeks)
  • Included logged-in users that have not previously interacted with the reply tool (defined as users whose discussiontools-editmode preference is empty)
  • Included 22 Wikipedias
  • During this test, 50% of users included in the test had the Reply tool automatically enabled, and 50% did not. Users at these Wikipedias were still able to turn the tool on or off the tool in Special:Preferences.
  • Data logged during that timeframe:
    • About 1300 edits attempts by Junior Contributors in each test group
    • A little over 2500 edit attempts across all experience levels in each test group

I will use this information to help determine the AB test set-up requirements for the New Discussion Tool

Here are my current recommended estimates:

The number of wikis the A/B test would need to be run on

I'd recommend running the test on 15+ wikis at minimum including a combination of large and medium-sized wikis. Small wikis can also be included as well but this should be in addition to the 15 large and medium-sized wikis to ensure we have a sufficient number of wikis with a representative sample of data.

Rationale
(1) Size of the wikis included: In this proposed AB test, each user would serve as our observational unit. As a result, we would need a sufficient number of distinct users across all participating wikis to calculate the metrics above. If we included larger wikis in this test such as English Wikipedia or Spanish Wikipedia then we'd need a fewer number of total wikis to obtain a sufficient sample of users. If we included only smaller wikis, we'd likely need to increase this number to get a sufficient number of users.
(2) Accounting for the effect of the wiki on the metrics. In the Reply Tool AB test, we accounted for the effect of the wiki on the success probability of a Junior Contributor completing an edit, which allowed us to more accurately understand the effect of the reply tool. I'd recommend we do the same in this test. The more distinct wikis we are able to include in the test, the more data we will have to account for the effect of each wiki.
(3) Obtaining a representative sample for each wiki In the Reply Tool Test, there were 22 participating wikis but only about 15 of those wikis including the larger and mid-sized wikis included sufficient events to be included in the analysis. Smaller wikis such as Swahilli and Afrikanns had only a couple of events logged during the Reply Tool test and as a result we were not able to make any conclusions about these wikis.

For how long the A/B test would need to be run on

I'd recommend running the test for a minimum of 4 weeks similar to the Reply Tool. After the 4 week period, I'd recommend reviewing the number of events logged to confirm it is sufficient for the analysis prior to ending the test. Note: This assumes we are running the test on 15+ large and small wikis. The test might need to be run longer if a smaller set of wikis are included.

Rationale:
A review of daily edit attempts indicates that there are roughly twice as many daily reply tool edit attempts compared to new discussion tool edit attempts (looking at wikis where both are deployed as opt-in features). If we assume the same ratio, a 4 week test on a similar set of participating wikis would give us around 600 edit attempts by junior contributors and 1250 edit attempts across all experience levels. While this is less than the Reply Tool, it would be sufficient to complete the analysis and provide data on weekday/weekend trends but more data is always helpful if possible.

@MNeisler the information you shared T290204#7356556 contains all that @Whatamidoing-WMF and I needed to draft a list of wikis we think would be valuable to include in the New Discussion Tool A/B test (T277825) as well the approximate duration of the test...thank you.

I've documented the above in the following places:

Next steps
I'm thinking the next step will be for you, @MNeisler, to verify the list of potential wikis in T291306 satisfy the criteria you shared in T290204#7356556.

As such, I'm going to resolve this task.