Adjust search satisfaction sampling rate
Closed, ResolvedPublic

Description

Background

The current sampling rate of 1 in 200 doesn't really work for wikis that aren't English Wikipedia as they don't get as much traffic. This leads to certain wiki's metrics having a lot of day-to-day variability (from development version of search metrics dashboard):

because we simply don't have a lot of search sessions recorded for them on a daily basis:

Proposal

To get a more stable/consistent estimate, we should up the sampling rates. Instead of specifying it for each wiki in searchSatisfaction.js, @TJones recommended that we increase the overall sampling rate and manually specify a lower one for a few select wikis such as enwiki.

For example, consider the following table (source code at the bottom):

wikimed sessions/day in April 2017totaltargettarget rate (numeric)target rate (fraction)target rate ("one in")expected sample
enwiki10406~208120010000.0004803 / 62502084~999
dewiki1862~37240010000.002685537 / 2e+05373~999
ruwiki1231~24620010000.0040622031 / 5e+05247~997
eswiki924~18480010000.0054115411 / 1e+06185~999
frwiki789~15780010000.0063376337 / 1e+06158~999
jawiki522~10440010000.0095799579 / 1e+06105~995
itwiki451~9020010000.0110865543 / 5e+0591~992
zhwiki416~8320010000.01201912019 / 1e+0684~991
plwiki301~6020010000.01661116611 / 1e+0661~987
ptwiki296~5920010000.0168924223 / 25000060~987
cswiki229~4580010000.02183410917 / 5e+0546~996
nlwiki215~4300010000.0232562907 / 12500043~1000
enwiktionary194~3880010000.02577325773 / 1e+0639~995
commonswiki151~3020010000.03311333113 / 1e+0631~975
kowiki150~3000010000.03333333333 / 1e+0631~968
arwiki117~2340010000.0427358547 / 2e+0524~975
svwiki109~2180010000.0458722867 / 6250022~991
trwiki109~2180010000.0458722867 / 6250022~991
fiwiki88~1760010000.05681828409 / 5e+0518~978
fawiki62~1240010000.08064516129 / 2e+0513~954
viwiki61~1220010000.08196781967 / 1e+0613~939
hewiki55~1100010000.09090990909 / 1e+0612~917
huwiki51~1020010000.09803998039 / 1e+0611~928
frwiktionary46~920010000.10869613587 / 12500010~920
idwiki44~880010000.11363628409 / 2500009~978
ruwiktionary43~860010000.116279116279 / 1e+069~956
nowiki42~840010000.11904814881 / 1250009~934
ukwiki40~800010000.1250001 / 88~1000
dewiktionary39~780010000.12820525641 / 2e+058~975
cawiki25~500010000.2000001 / 55~1000
rowiki22~440010000.227273227273 / 1e+065~880
dawiki19~380010000.263158131579 / 5e+054~950
skwiki17~340010000.294118147059 / 5e+054~850
thwiki17~340010000.294118147059 / 5e+054~850
elwiki16~320010000.3125005 / 164~800
wikimania2017wiki16~320010000.3125005 / 164~800
etwiki13~260010000.38461576923 / 2e+053~867
plwiktionary13~260010000.38461576923 / 2e+053~867
simplewiki13~260010000.38461576923 / 2e+053~867
srwiki12~240010000.416667416667 / 1e+063~800
bgwiki11~220010000.45454590909 / 2e+053~734
enwikibooks10~200010000.5000001 / 22~1000
enwikiquote10~200010000.5000001 / 22~1000
hrwiki10~200010000.5000001 / 22~1000
kawiki10~200010000.5000001 / 22~1000
wikidatawiki10~200010000.5000001 / 22~1000
cswiktionary9~180010000.555556138889 / 2500002~900
ltwiki9~180010000.555556138889 / 2500002~900
azwiki8~160010000.6250005 / 82~800
hywiki8~160010000.6250005 / 82~800
svwiktionary8~160010000.6250005 / 82~800
slwiki7~140010000.714286357143 / 5e+052~700
hiwiki6~120010000.833333833333 / 1e+062~600
mediawikiwiki6~120010000.833333833333 / 1e+062~600
mswiki6~120010000.833333833333 / 1e+062~600

This suggests that if we want to maintain a target sample of 1000 sessions per day to get reliable metrics, we should change the overall sampling rate to 1 in 10 and then specify custom ones (from the table) for enwiki (1 in 2000), dewiki (1 in 370), ruwiki (1 in 250), eswiki (1 in 200), frwiki (1 in 160), ..., commons (1 in 30), and fawiki (1 in 15)

Appendix

library(tidyverse)
counts <- wmf::mysql_read("SELECT
	wiki,
	DATE(LEFT(timestamp, 8)) AS `date`,
	COUNT(DISTINCT(event_mwSessionId)) AS sessions
FROM TestSearchSatisfaction2_16270835
WHERE
	LEFT(timestamp, 6) = '201704'
	AND (event_subTest IS NULL OR event_subTest = '')
GROUP BY wiki, `date`;", "log")

target_sample <- 1000
counts %>%
    group_by(wiki) %>%
    summarize(`med sessions/day in April 2017` = ceiling(median(sessions))) %>%
    arrange(desc(`med sessions/day in April 2017`)) %>%
    mutate(
        total = `med sessions/day in April 2017` * 200,
        target = target_sample,
        `target rate (numeric)` = round(target / total, 6),
        `target rate (fraction)` = vapply(`target rate (numeric)`, FRACTION::fra, ""),
        `target rate ("one in")` = vapply(`target rate (fraction)`, . %>% strsplit(" / ") %>% { .[[1]] } %>% as.numeric %>% { .[2]/.[1] } %>% ceiling, 0),
        `expected sample` = paste0("~", ceiling(total * (1/`target rate ("one in")`)))
    ) %>%
    filter(`target rate (numeric)` < 1) %>%
    mutate(total = paste0("~",  total))
mpopov created this task.Apr 18 2017, 10:32 PM
Restricted Application added subscribers: Cosine02, Base, revi, Aklapper. · View Herald TranscriptApr 18 2017, 10:32 PM

Sounds like a plan!

I'd suggest a less precise tuning—so rather than 1 in 360 for dewiki, maybe ~10% more at 1 in 325. Better to have a little more data than you need than a little less.

mpopov updated the task description. (Show Details)Apr 19 2017, 4:41 PM

Change 348969 had a related patch set uploaded (by Bearloga):
[mediawiki/extensions/WikimediaEvents@master] Adjust search satisfaction sampling rates

https://gerrit.wikimedia.org/r/348969

Change 348969 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Adjust search satisfaction sampling rates

https://gerrit.wikimedia.org/r/348969

Change 348974 had a related patch set uploaded (by Bearloga):
[mediawiki/extensions/WikimediaEvents@master] Minor adjustment of adjusted sampling rates

https://gerrit.wikimedia.org/r/348974

Change 348974 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Minor adjustment of adjusted sampling rates

https://gerrit.wikimedia.org/r/348974

Deskana assigned this task to mpopov.Apr 20 2017, 5:06 PM
Deskana added a subscriber: Deskana.

This seems to have been completed.

Deskana closed this task as Resolved.Apr 20 2017, 5:06 PM