Page MenuHomePhabricator

HostBot-AI study coordination with NotConfusing
Closed, ResolvedPublic

Description

Research page: https://meta.wikimedia.org/wiki/Research:ORES-powered_TeaHouse_Invites

Currently scheduled for deployment in early/mid Jan.

  • review research page
  • calculate average response rate to invites over past 12 months (it's about 2.5%)
  • implement "odd numbered user ID" check in HostBot code
  • securely share HostBot credentials with Max
  • log bot request on User:HostBot
  • remove odd-number check after experiment concludes

Event Timeline

Capt_Swing triaged this task as Normal priority.Dec 4 2018, 11:35 PM
Capt_Swing created this task.

@notconfusing @Halfak please tag and inherit as appropriate. Not sure where else you're tracking this work.

notconfusing added a comment.EditedDec 5 2018, 3:16 PM

Did some power analysis:
https://egap.shinyapps.io/power-app/

With the binary variable : did the user post on Wikipedia talk:Teahouse?
With the baseline proportion 0.04
To see the an increase to 0.05
with an 80% chance
requires: 14000 samples per group

28000 samples = 96 days * 300 invites per day.


To see the an increase to 0.06
requires: 3721 samples per group

7442 samples = 25 days * 300 invites per day.


So the ultra conservative route is to go for 12 weeks. The medium road is to go for 4 weeks.

Capt_Swing moved this task from Staged to In Progress on the Research board.Dec 10 2018, 6:05 PM
Capt_Swing updated the task description. (Show Details)Jan 3 2019, 10:08 PM

@Capt_Swing says the base rate is actually 2-3% so the updated calculations are:

With 80% chance of seeing an effect

from 2% to 3% we need 7,645 users per group
from 3% to 4% we need 11,000 users per group

At a rate of 150 users per group per day we need:
between 50 and 73 days.
Since we've gone this far I think we may as well run for 75 days.

If we start Feb 1. Then Feb 1 + 75 days is 17 Apr 2019.

Halfak added a comment.Jan 7 2019, 9:15 PM

+1 I think this sounds like a reasonable test. We should probably set aside some time to run a 2-3 day pilot as well. Is that in the plan?

Updates (ping @Halfak , @Capt_Swing )

+ HostBot-AI is live now!

+ Completed:
+ Login procedure as HostBot with OAuth
+ Extra checks for bot-exclusion templates and user warnings.
+ Inviting evens-only.
+ Lowered invite thresholds by 2% points to compensate for issue where live-scores have lower average values than what I saw in training. (Drift?)
+ Completed manual tests triggering bot from commandline. Now executing once per hour via cron on wikimedia vps.

+ Need to check in on code tomorrow to ensure that :

+ Maximum invites per day was not exceede
+ Minimum invites (150) were sent.
+ Error logs check out. 
+ No angry emails or Talk Pages.

Lab Notebook comments

  1. Things I'm noticing about the AI-live. Lots of people invited who have just edited their own user-pages. Not necessarily a bad sign, but it maybe be a different sort of user we are inviting (they don't even have any main namespace edits in some cases, just futzing about in their own talk-pages).
  2. Drift issue to lower predictions as mentioned earlier.

Also, if we don't need to restart the experiment for some reason.
Setting a calendar reminder for :29 Apr 2019 = 12 Feb 2019 + 76 days (75 full days, plus half of the 12th).

Thanks for the updates. It's exciting to see this moving forward. Would you mind posting your observations about newcomers who are only editing their user page on the teahouse talk page? https://en.wikipedia.org/wiki/Wikipedia_talk:Teahouse#Research_about_new_users

I imagine there could be concerns about that. Presumably, we could filter out newcomers who never edit mainspace if that is desirable.

You know, I noticed that also Jmo's hostbot was inviting just as many users with No namespace edits, which put me at ease. So I don't know if mentioning it will be too alarming.
After running for one day, I noticed that Hostbot-AI is

  1. functioning smoothly (at least not crashing)
  2. not inviting
  3. had one bug revolving around inviting editors who are "re-predicted" (considering editors again if its the same day and they have a higher edit count than last time).

I wanna just conduct a few more days of live testing before I'm happy with the class threshholds then I think we are set to really go live with our experiment.

Aha! If old hostbot was doing it, I don't see reason to raise alarm.

Capt_Swing updated the task description. (Show Details)Feb 21 2019, 7:40 PM

Update: the experiment is live!

Update: we've requested a ~3 week extension of the trial, to make sure we gather sufficient data, see discussion here: https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval#HostBot_9

Capt_Swing updated the task description. (Show Details)Jul 10 2019, 8:24 PM

The trial has concluded, and @Maximilianklein will perform the analysis of the data we collected.

leila closed this task as Resolved.Jul 11 2019, 12:00 AM

@notconfusing (After Wikimania lightning demo) The results look really nice, and it's great to see more success out of the Teahouse.

I was wondering if this experiment also accounted for a possible bias in ORES for choosing users that are more likely to be retained?

That is, did we distinguish between users that ORES choose to highlight and retained because of the Teahouse invitation vs those that would have retained regardless? One could rule that out by having a control group on both sides (e.g. have half the users that ORES selects invited to Teahouse and half not and compare that against a similar 50-50 split of legacy-Heuristic-bot selected users). I hope the question makes sense. The goal is that we have more users retained in total (with Teahouse) vs retained users without Teahouse (as opposed to having more retained users participate in Teahouse).

Thanks for working on this!

notconfusing added a comment.EditedAug 26 2019, 4:27 PM

Hi @Krinkle ,

Thanks for writing. Yes, as you noticed there are many confounding factors. The original Teahouse paper[1] showed a retention improvement of heuristics-vs-control, so that side is "proved" (although a replication would always be great). As for AI-vs-control (not AI-vs-heuristics), I did have a plan for how to do that. Since I was to invite no more than 150 users per day, I tried to tune the threshhold of the model to invite about 150 user per day, but on days that had more positives than that I gave those users the status "overflow", that is would-be-invited. I planned to use theses as a control for AI-vs-control. I see that I had 4,281 overflows and 8,223 invited, so I'm hoping we will have enough statistical power.

I'll ping you when I get time to look over the results. Thanks for you interest and keen eye.

(BTW I just used your SUL tool, which worked very well so thanks for that too.)

[1] [1] http://www.opensym.org/wp-content/uploads/2018/07/OpenSym2018_paper_15.pdf