In order to understand the effects of an A/B test with changes to search on Commons, we need to establish baselines. Secondly, we hope this data can be used to build dashboards as mentioned in the parent task.
A first pass on calculating these baselines has now been done. The numbers and the calculations can be found in this Jupyter notebook. It uses the past 7 days as the source of the data and was run on 2020-08-16, meaning it reflects the week from 2020-08-09 through 2020-08-15. It does make some assumptions and shortcuts, and I'm happy to discuss those and modify the code as we see fit.
For quick reference, the results are as follows:
Number of searches per day:
Daily average over a 7-day period.
- Number of full-text searches with results: 97,826.0
- Number of full-text searches with zero results: 3,646.0
- Number of autocomplete searches: 52473.6
Number of search sessions per day:
Again a daily average over a 7-day period, and split into sessions with at least one full-text search, and sessions with at least one autocomplete search to mirror the previous split, and to make the next statistic of number of searches per session make sense (because autocomplete searches often runs multiple searches as the user types). Sessions with 50 or more searches are removed as those are regarded as "non-human" in previous search analyses.
- Number of sessions with at least one full-text search: 21,099.6
- Number of sessions with at least one autocomplete search: 17,438.6
Average number of searches per session:
This uses all sessions across the entire week, and again removes sessions with 50 or more searches. We use the median, as the long-tailed distribution renders the mean meaningless.
- Median number of full-text searches per session: 2
- Median number of autocomplete searches per session: 5
Search session length:
This is the difference between the first search result page event in a session and the last event recorded in that session (note that this includes check-in events on any visited page, so it also measures dwell time). It's measured for any session during the 7-day period, and sessions with 50 or more searches are again removed. We use the median length due to the non-Normal distribution of session length.
- Median search session length: 48.1 seconds
Defined as the proportion of search sessions where the user clicked at least one of the results.
- Click-through rate: 71.6%
Average position of clicked result in successful searches:
Defined as the median position of a clicked result in a session where the user clicked at least one of the results, because again, this distribution is long-tailed.
- Median position of a clicked result: 4
@CBogen & @Ramsey-WMF : I think I'm mainly curious about whether we should not count full-text and autocomplete searches separately for search sessions, and whether the average position measurement should require some amount of dwell time to count it as a "successful search". Happy to learn about what other questions you have from these baselines too!
Thanks @nettrom_WMF, this is great!
A couple of questions:
Does the number of full-text searches with results (97,826.0) include the number of autocomplete searches?
Are autocomplete searches counted even if the user doesn't run them? (ie, when an autocomplete term is populated, but the user doesn't select it?)
WRT dwell time, I think in previous discussions we determined that for images, dwell time isn't meaningful - but if I'm remembering incorrectly, please let me know.
@CBogen : the number of full-text searches does not include any autocomplete searches. It does include full-text searches that originate from an autocomplete search (e.g. the user clicks on the "contains …" part, or hits enter) because identifying those to separate them out is tricky.
Autocomplete searches count even if the user doesn't click on any of the results to go the suggested page. But, we only count them once per page in the number of searches made.
You're right that for images in a grid layout (like the new Media Search), dwell time isn't meaningful. I'm not sure how exactly it works in a list layout like legacy search since it conveys information differently. I think we can keep the definitions we have for now, and if it turns out that we run into issues with them down the road we can adjust.