Validate bucketing used for backend tests which report to CirrusSearchUserTesting log
Closed, ResolvedPublic3 Story Points

Description

The last analysis @Ironholds ran showed that the balance of requests for web were fairly even, but that significantly more requests ended up in bucket a for the api.

This is most likely because we are using a consistent bucketing scheme. This scheme is:

  1. take the ip address + x-forwarded-for + user-agent, md5 them together : https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/ElasticsearchIntermediary.php#L593
  2. convert that 128 bit number to a floating point probability between 0 and 1 : https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/UserTesting.php#L182
  3. accept all users that meet 1/$sampleRate >= $probability : https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/UserTesting.php#L207

I havn't done analysis of the logs to fully verify this, but most likely the reason for the misbalance is that some users send 1 or 2 requests and some users send 100k requests. Those heavy users will bias whatever bucket they end up in.

EBernhardson updated the task description. (Show Details)
EBernhardson raised the priority of this task from to Needs Triage.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 25 2015, 5:44 PM
mpopov claimed this task.Nov 30 2015, 10:01 PM
mpopov set Security to None.
mpopov edited a custom field.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

I'm starting to dig into the various log locations we have to figure out that uneven bucketing of API users.

  • I tried EBernhardson::CirrusSearchRequestSet but that doesn't have the necessary components needed to reconstruct that MD5 hash you use for bucketing. That is, the IP address there is the resolved IP address.
  • I tried the files in /a/mw-log/archive/CirrusSearchUserTesting but those have the same issue as ^
  • I tried wmf::webrequest but that includes everything but source so I can't compare web vs API.
  • I tried the files in /a/mw-log/archive/CirrusSearchRequests but all the identifying info is now an 'identity' hash that I' not entirely sure how to unpack into an IP address + X-forwarded-for + User-Agent set.

It appears that https://wikitech.wikimedia.org/wiki/Analytics/Data/Cirrus has all the necessary components but it doesn't seem like it's up?

I am confused where to look for a dataset that would allow me to reconstruct the bucketing pipeline you described for Cirrus searches from web vs API. Any advice? Am I just being silly by not noticing something?

  • The resolved IP address in CirrusSearchRequestSet is the same one used to generate the MD5 hash. Generally the raw ip address isn't used anywhere in the application layer, it points to one of our internal varnish servers. The identity field of CirrusSearchRequestSet is the same md5(ip + xff + user-agent) that is used for bucketing. I think this should be reasonable to work from.
  • CirrusSearchUserTesting won't be useful, the User-Agent here is mangled (at olivers request) to eliminate tabs, single quotes and double quotes. As such it's not the same values used. Additionally we dont have the xff header here.
  • For wmf.webrequest we can kinda/sorta determine if it was web or api, basically by checking if `uri_path = '/w/api.php'. This will only work for most requests, there are always oddball things in wmf.webrequest.
  • CirrusSearchRequests again uses the same identity value as is used for bucketing.

I'd suggest using the identity fields in either CirrusSearchRequestSet or CirrusSearchRequests

mpopov added a comment.Dec 1 2015, 1:44 PM

Awesome, thanks! :D

Report looks good, but what are the recommendations? That we just deal with it? That we switch to using executor ID for bucketing?

mpopov added a comment.Dec 3 2015, 2:54 PM

Done https://github.com/wikimedia-research/Validate-Cirrus-Bucketing/blob/master/T119639.pdf :-)

Summary: After the experimental analysis of our previous A/B test (Language Switch test), we became concerned about our procedure of selecting users for testing and assigning them to experimental/control groups. While the web queries were evenly bucketed, API queries were not as evenly bucketed. In this assessment we validated the technique and showed that the users were evenly bucketed from both sources, but that heavy users of the API skewed the bucketing proportions as hypothesized.

Recommendation: Going forward, if we are studying queries and regarding them as individual sampling units, then it is our highest recommendation to shift to a per-query sampling rather than a per-user sampling. In this report we show how bucketing looks like when the timestamp is included in the creation of the hex identity on which sampling and bucketing relies. That is, we should still include the user identity hash for grouping queries into sets if the analysis requires it, but we should add an additional field for query identity hash which is the one that we will use for sampling and bucketing. This is our recommendation until we switch to performing these kinds of tests through Relevance Labs.

Not until you've but it on Commons it isn't!

mpopov added a comment.Dec 3 2015, 3:07 PM
This comment was removed by mpopov.
Deskana triaged this task as Normal priority.Jan 21 2016, 1:34 AM
Deskana added a subscriber: Deskana.
Deskana closed this task as Resolved.Feb 4 2016, 6:20 AM