Launch page previews A/B test on enwiki and dewiki
Closed, ResolvedPublic2 Story Points

Description

Background

We will be performing an A/B test on enwiki and dewiki to gauge the performance of the page previews feature, especially in relation to fundraising.

Note that we're removing EventLogging instrumentation sampling from Page Previews in T172291: Launch page previews A/B test on enwiki and dewiki so it's hard to guess at an ideal bucket size from previous sampling rates.

Acceptance Criteria

  • Launch A/B test (launched on 2:23 PM 28th August 2017)
  • Verify bucketing is working (is popupEnabled disabled for 50% of pageLoaded events?)

@phuedx: Tracked in T175377: [Spike 4hrs] Verify EventLogging instrumentation/bucketing for the enwiki/dewiki A/B test

Notes

  1. Launching the A/B test requires the following configuration:
InitialiseSettings.php
/* ... */

  'wmgUsePopups' => [
    'enwiki' => true,
    'dewiki' => true,
  ],
  'wmgPopupsUseBetaFeature' => [
    'enwiki' => false,
    'dewiki' => false,
  ],
  'wgPopupsAnonsExperimentalGroupSize' => [

    // Read: Enable previews for all anonymous users by default, which is the current behaviour on all wikis except enwiki and dewiki per T162672. 
    'default' => 0,

    'enwiki' => 0.006,
    'dewiki' => 0.016,
  ],
  'wgPopupsEventLogging' => [
    'default' => false,
    'enwiki' => true,
    'dewiki' => true,
  ],

/* ... */

2. Remove the 'wgPopupsAnonsEnabledSamplingRate' and the 'wgPopupsSchemaSamplingRate' entries from InitialiseSettings.php.
@phuedx: Base your work on rOMWC313a2e189f10: pagePreviews: remove invalidated popup sampling rate variables as @Niedzielski has already done this!

Questions

  • What are the bucket sizes for each group per project?

@phuedx:

wikion:off:control$wgPopupsAnonsExperimentalGroupSize
enwiki0.003:0.003:0.9940.006
dewiki0.008:0.008:0.9840.016

With the caveat that these numbers are based on peak rates rather than average rates.

  • What's the maximum rate of events that we can send?

@Tbayer: Depends on the overall amount of data generated in combination with current disk space concerns, see T172322: Calculate how much Popups events EL databases can host

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
phuedx updated the task description. (Show Details)Aug 4 2017, 12:48 PM
Tbayer updated the task description. (Show Details)Aug 4 2017, 9:10 PM
Tbayer added a comment.Aug 4 2017, 9:17 PM

@phuedx: Unfortunately, it's still 600/minute (see T172322: Calculate how much Popups events EL databases can host)

That's not what we have learned from T172322. To summarize briefly:

  • the 600/minute limit was introduced here last month, as a ballpark figure above which one should ping AE first about whether a schema needs to be set to be HDFS only (without the usual MySQL tables).
  • It's about average, not peak rates, and as I understand it now, it does not refer to the maximum intake rate possible (we have successfully used schemas with 6000 events/sec peak rate before: MobilWebSectionUsage). Rather, it's about the current disk space shortage situation.
Jdlrobson added a subscriber: Jdlrobson.

T171853 is in sign off column. It should be possible to do this now, but we should bear in mind that sign off of that task will give us more confidence this is ready to go!

ovasileva updated the task description. (Show Details)Aug 23 2017, 5:15 PM
phuedx updated the task description. (Show Details)Aug 23 2017, 5:30 PM
Tbayer updated the task description. (Show Details)Aug 23 2017, 11:16 PM
Tbayer updated the task description. (Show Details)Aug 24 2017, 7:25 PM

I updated the talk page: https://meta.wikimedia.org/wiki/Schema_talk:Popups#Sampling
Hope that clears up the new A/C

phuedx updated the task description. (Show Details)Aug 25 2017, 8:57 AM
Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 8:57 AM

Thanks for starting that off, @Jdlrobson! I've tweaked the section a little to group the definition of the buckets and their behaviours, added permalinks to the codebase, and converted markdown to wikitext where appropriate.

phuedx updated the task description. (Show Details)Aug 25 2017, 9:57 AM
phuedx updated the task description. (Show Details)
phuedx added a subscriber: gerritbot.

Change 373171 had a related patch set uploaded (by Phuedx; owner: Sniedzielski):
[operations/mediawiki-config@master] pagePreviews: remove invalidated popup sampling rate variables

https://gerrit.wikimedia.org/r/373171

Change 373920 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Enable an A/B test for page previews on EN and DE wikis

https://gerrit.wikimedia.org/r/373920

Jdlrobson added a subscriber: elukey.

Plan is to deploy https://gerrit.wikimedia.org/r/373171 + https://gerrit.wikimedia.org/r/373920 Monday. Ping @elukey @Tbayer - will be helpful if both of you are around then!

Change 373171 merged by jenkins-bot:
[operations/mediawiki-config@master] pagePreviews: remove invalidated popup sampling rate variables

https://gerrit.wikimedia.org/r/373171

Change 373920 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable an A/B test for page previews on EN and DE wikis

https://gerrit.wikimedia.org/r/373920

Change 374390 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Enable Popups on en and de wiki for A/B test

https://gerrit.wikimedia.org/r/374390

Change 374390 merged by Niharika29:
[operations/mediawiki-config@master] Enable Popups on en and de wiki for A/B test

https://gerrit.wikimedia.org/r/374390

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2017-08-28T19:23:24Z] <niharika29@tin> Synchronized wmf-config/InitialiseSettings.php: Grant arbcomers abusefilter-view-private and abusefilter-log-private at cswiki T174357; Enable popups on en and de wiki for A/B test T172291 (duration: 00m 43s)

Jdlrobson updated the task description. (Show Details)Aug 28 2017, 7:36 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson removed a project: Patch-For-Review.

After the swat, Popups is no longer listed in beta features, but an "Enable previews" is shown in the footer of all page views for anons.
When logged in, I see the option to enable page previews in the Appearance tab: https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-rendering but it is off by default.

As an anon, I managed to simulate being in each of the groups by bucketing myself like so:

mw.storage.session.set( 'mwuser-sessionId', 191)
session idgroupmw.popups.isEnabled()can i see popups?
191controlfalsefalse
1offfalsefalse
354ontruetrue

I'm seeing events in the console for the control and on groups and a spike on the graph: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Popups&refresh=5m&orgId=1
We'll want to verify the bucketing

We can consider the A/B test up and running.

srv.byte_free is around 790GB at moment. Keeping an eye on it per a/c.

phuedx updated the task description. (Show Details)Aug 29 2017, 5:35 PM

@Ottomata, @Marostegui: With the bucket sizes that we've defined for this A/B test we're seeing a rate of ~10 Popups events/s at peak time. We'd like your permission to start collecting 100 events/s.

Hi!

@elukey do you know what does that change from 10 to 100 means in terms of disk space?
@phuedx the main issue is that dbstore1002 is running out of space - T168303#3562196 that was yesterday, and today we are at:

root@dbstore1002:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.8T  606G  91% /srv

I am not sure if  @elukey is aware of more tables ready to be dropped, but I would suggest we do not increase the collection rate too much or dbstore1002 will definitely run out of space.

@elukey, @Marostegui: AFAIK we'll be running the A/B test until we record some number of events and not indefinitely. @Tbayer: Can you confirm this? If this is the case, then we'd simply disable the A/B test sooner.

@phuedx - I believe we were planning on running the test for 2 weeks. Given the low sampling rates, I think we can do two weeks from the point that we increase the rate. Assuming we can do that today, we would be finishing the test on September 13th. @Tbayer - correct me if I'm wrong on any of the above.

Yes, as @phuedx and @ovasileva note, this instrumentation is not meant to run indefinitely.
As a reminder, the disk space issue has already been discussed extensively at T172322 (@Marostegui, I think you were CCed there at some point but the conversation was mainly handled by other people on the Ops side), which resulted in an assessment that this test can go ahead (T172322#3533459 ; also after the Readers team had put in some extra work to help free up space by dropping another table). It looks like the "Notify DBA and Analytics Engineering when launching" part of the present task was misunderstood a bit above as launching another assessment process essentially duplicating T172322. Rather, the Ops suggestion at T172322#3533459 had been to provide a notification so that disk space use can be monitored after the launch (also by ourselves - that's why the Grafana link is in the task description).

Yes, as @phuedx and @ovasileva note, this instrumentation is not meant to run indefinitely.
As a reminder, the disk space issue has already been discussed extensively at T172322 (@Marostegui, I think you were CCed there at some point but the conversation was mainly handled by other people on the Ops side), which resulted in an assessment that this test can go ahead (T172322#3533459 ; also after the Readers team had put in some extra work to help free up space by dropping another table). It looks like the "Notify DBA and Analytics Engineering when launching" part of the present task was misunderstood a bit above as launching another assessment process essentially duplicating T172322. Rather, the Ops suggestion at T172322#3533459 had been to provide a notification so that disk space use can be monitored after the launch (also by ourselves - that's why the Grafana link is in the task description).

Thanks for providing some context, @Tbayer.
I am fine with this, but I do encourage everyone to check the disk space graph and stop as soon as the available space is less than 900-850G (right now we have 950G available). Otherwise, we will run into serious problem (/cc @elukey )
This host grows quite a lot daily so that's why I am being very careful here, as if we do not control it, we will have a read-only host because of no disk space available :-)

Change 374815 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/mediawiki-config@master] pagePreviews: Scale A/B test bucket sizes by 10

https://gerrit.wikimedia.org/r/374815

Change 374815 merged by jenkins-bot:
[operations/mediawiki-config@master] pagePreviews: Scale A/B test bucket sizes by 10

https://gerrit.wikimedia.org/r/374815

Mentioned in SAL (#wikimedia-operations) [2017-08-30T13:55:24Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: pagePreviews: Scale A/B test bucket sizes by 10 - T172291 (duration: 00m 46s)

phuedx updated the task description. (Show Details)Aug 30 2017, 2:00 PM

@Tbayer - can you confirm that the sample rates are giving us adequate amounts of data?

Tbayer added a comment.Sep 1 2017, 4:59 PM

@Tbayer - can you confirm that the sample rates are giving us adequate amounts of data?

Yes. Note that (because of weekly seasonality) we will want to have two full weeks of valid data.

Tbayer added a comment.Sep 5 2017, 8:21 AM

Yes, as @phuedx and @ovasileva note, this instrumentation is not meant to run indefinitely.
As a reminder, the disk space issue has already been discussed extensively at T172322 (@Marostegui, I think you were CCed there at some point but the conversation was mainly handled by other people on the Ops side), which resulted in an assessment that this test can go ahead (T172322#3533459 ; also after the Readers team had put in some extra work to help free up space by dropping another table). It looks like the "Notify DBA and Analytics Engineering when launching" part of the present task was misunderstood a bit above as launching another assessment process essentially duplicating T172322. Rather, the Ops suggestion at T172322#3533459 had been to provide a notification so that disk space use can be monitored after the launch (also by ourselves - that's why the Grafana link is in the task description).

Thanks for providing some context, @Tbayer.
I am fine with this, but I do encourage everyone to check the disk space graph and stop as soon as the available space is less than 900-850G (right now we have 950G available). Otherwise, we will run into serious problem (/cc @elukey )
This host grows quite a lot daily so that's why I am being very careful here, as if we do not control it, we will have a read-only host because of no disk space available :-)

Hm, my takeaway from @elukey's linked comment had actually been that we expect the experiment to fit in the available space with the planned length and event rate.

And according to Grafana, the free space is actually at 872GB right now (down from 950 GB at the time of your comment six days ago). However, unless I'm misunderstanding something about the nature of replication, this can't really be the fault of our experiment, because the table is currently not replicating anyway, or only very slowly (until August 31 10am per T174815#3579120), and still takes only 17.6GB (uncompressed, I assume) per [1].

Does this mean we have to stop all active EventLogging schemas right now?

[1]
SELECT table_name AS `Table`, 
round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` 
FROM information_schema.TABLES 
WHERE table_schema = "log"
 AND table_name LIKE "Pop%";

+--------------------------+------------+
| Table                    | Size in MB |
+--------------------------+------------+
| Popups_11625443          |    2033.50 |
| Popups_15597282          |    4528.22 |
| Popups_15777589          |    3626.17 |
| Popups_15906495          |   59453.78 |
| Popups_16208085_15423246 |      10.28 |
| Popups_16364296          |   17604.48 |
| Popups_16364296_15423246 |     353.45 |
| Popups_7536956           |    6679.04 |
+--------------------------+------------+
8 rows in set (0.00 sec)

This has been lingering in sign off for some time and I wonder if there is a better way to capture the remaining problems, in a "verify A/B test card" that lives in the blocked column and has a clearer outline on what the remaining problem is here. The A/B test has been launched so it's misleading to leave this open (cc @MBinder_WMF ).

@Jdlrobson I'm biased towards making things more granular, so I like your plan. :)

I'm down for breaking out the the "Verify the results of the Page Previews enwiki/dewiki A/B test" task.

@MBinder_WMF: How would you suggest that we track the important task of monitoring EL disk space?

@phuedx .... would it make sense to resolve this and move T172322 into blocked?

@Jdlrobson, @phuedx - before we resolve there was a question on email from @Pcoombe on whether our sample rate is matching the event rate we're seeing. I believe so, but would it be possible to double-check? After that I think we're good to close this.

phuedx added a comment.Sep 8 2017, 2:45 PM

Hopefully, @Pcoombe won't mind me quoting his email as it speaks to AC #2:

We're seeing some odd results here, although I don't think it's related to the eventlogging issues you mention. Quick question: can you confirm what percentage of people should be seeing hovercards? And does that match what you're seeing in your logging?

phuedx updated the task description. (Show Details)Sep 8 2017, 3:11 PM
phuedx claimed this task.Sep 8 2017, 3:41 PM
phuedx updated the task description. (Show Details)Sep 8 2017, 3:55 PM
phuedx added a comment.EditedSep 8 2017, 4:17 PM

I think this task could be closed now. We have a venue for discussing the instrumentation (T175377).

phuedx removed phuedx as the assignee of this task.Sep 8 2017, 4:17 PM

Staying true to our SOP, I worked on this so while I'd really like to sign off on it, I'll leave that to @ovasileva.

phuedx assigned this task to ovasileva.Sep 8 2017, 4:18 PM

@MBinder_WMF: How would you suggest that we track the important task of monitoring EL disk space?

@phuedx If it doesn't fit nicely as its own task, and this task is being deprecated, can it be attached to another task that is being broken off this one? Otherwise, it may just need to be a small recurring task, as needed.