Let's be sure that the data we're expecting in this 2-way test is coming in like what we expect it to.
|Open||None||T174064 [FY 2017-18 Objective] Implement advanced search methodologies|
|Resolved||EBernhardson||T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results|
|Resolved||EBernhardson||T162369 Evaluate rescore windows for learning to rank|
|Resolved||None||T174066 [Q1 2017-18 Objective] Perform load and A/B tests on new models (interleaved search results)|
|Resolved||EBernhardson||T150032 Add support for interleaved results in 2-way A/B test|
|Resolved||debt||T171212 Interleaved results A/B test: turn on|
|Resolved||EBernhardson||T171213 Interleaved results A/B test: check that data is flowing the way we expect|
|Resolved||debt||T171214 Interleaved results A/B test: turn off test|
|Resolved||mpopov||T171215 Interleaved results A/B test: analysis of data|
|Declined||EBernhardson||T171984 Turn on test of LTR with standard AB buckets and an interleaved bucket.|
Data is actually much smaller than expected. At 1:2000 we collect ~15k sessions per day of full text. Sampling was increased to 1:500 and 75% of sessions were directed into the test, but the results were 15k sessions per day for dashboards and only ~600 sessions per day recording events into the test (when it should have been 45k). Not clear yet what happened.
Not clear yet whats gone wrong here. I've poked at the raw event logging events, inside the eventlogging-client-side kafka topic, and the same ratio of events by subTest is there. Looking at the webrequests table in hive shows the same ratio of events by subTest as well. This suggests the events are not being sent, or are being thrown away incredibly early in the pipeline (unlikely).
The breakdown of events that did get logged by either OS or Browser do not suggest we are failing on most browsers and only working in specific cases. Something else is going on but its really not clear what. Will continue investigating.
Some documentation from event logging is suspicious, but i also think this might not be the case anymore, because i see events making it through with a payload > 1kB. While our search result page events are >1kB, other events like 'visitPage' are much smaller so those should have still come through event if the search result page events were rejected. Also based on the doc the events should have been truncated, which would still be detectable, rather than completely disapearing:
There is a limitation of the size of individual EventLogging events due the underlying infrastructure (limited size of urls in Varnish's varnishncsa/ varnishlog, as well as Wikimedia UDP packets). For the purpose of size limitation, an "entry" is a /beacon request URL containing urlencoded JSON-stringified event data. Entries longer than 1014 bytes are truncated. When an entry is truncated, it will fail validation because of parsing (as the result is invalid JSON).
What went wrong here is i completely mis-estimated the event counts, by making the incorrect assumption enwiki made up the majority of logged search sessions. Because we vary our sampling by wiki enwiki makes up < 2% of the sessions we record.
Initial sampling rate: 1:2000
Sessions collected per day: ~250
Estimated sessions per day: 500,000
Desired sessions per bucket per day: 1000
Number of buckets: 6
Total sessions sampled: 250 + (6*1000) = 6250
New sampling rate = 500000/6250 = 1 in 80
% of sessions going into sub test: 6000/6250 = 0.96
I'll be deploying this update in a few minutes, and we should collect 1k events per day per bucket. I have no clue how many we need, but it means analysis of the previously collected data could go forward if we think it's enough.
Mentioned in SAL (#wikimedia-operations) [2017-08-18T00:48:42Z] <ebernhardson@tin> Synchronized php-1.30.0-wmf.14/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T171213: Increase sampling rate of cirrus satisfaction schema (again) to 1k per bucket per day (duration: 00m 44s)