Background
=========
The current sampling rate of 1 in 200 doesn't really work for wikis that aren't English Wikipedia as they don't get as much traffic. This leads to certain wiki's metrics having a lot of day-to-day variability (from [[ http://discovery-beta.wmflabs.org/metrics/#langproj_breakdown | development version of search metrics dashboard ]]):
{F7619401}
because we simply don't have a lot of search sessions recorded for them on a daily basis:
{F7619423}
Proposal
=======
To get a more stable/consistent estimate, we should up the sampling rates. Instead of specifying it for each wiki in [[ https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blame/master/modules/ext.wikimediaEvents.searchSatisfaction.js#L118--L133 | searchSatisfaction.js ]], @TJones recommended that we increase the overall sampling rate and manually specify a lower one for a few select wikis such as enwiki.
For example, consider the following table (source code at the bottom):
|wiki | med sessions/day in April 2017|total | target| target rate (numeric)|target rate (fraction) | target rate ("one in")|expected sample |
|enwiki | 10406|~2081200 | 1000| 0.000480|3 / 6250 | 2084|~999 |
|dewiki | 1862|~372400 | 1000| 0.002685|537 / 2e+05 | 373|~999 |
|ruwiki | 1231|~246200 | 1000| 0.004062|2031 / 5e+05 | 247|~997 |
|eswiki | 924|~184800 | 1000| 0.005411|5411 / 1e+06 | 185|~999 |
|frwiki | 789|~157800 | 1000| 0.006337|6337 / 1e+06 | 158|~999 |
|jawiki | 522|~104400 | 1000| 0.009579|9579 / 1e+06 | 105|~995 |
|itwiki | 451|~90200 | 1000| 0.011086|5543 / 5e+05 | 91|~992 |
|zhwiki | 416|~83200 | 1000| 0.012019|12019 / 1e+06 | 84|~991 |
|plwiki | 301|~60200 | 1000| 0.016611|16611 / 1e+06 | 61|~987 |
|ptwiki | 296|~59200 | 1000| 0.016892|4223 / 250000 | 60|~987 |
|cswiki | 229|~45800 | 1000| 0.021834|10917 / 5e+05 | 46|~996 |
|nlwiki | 215|~43000 | 1000| 0.023256|2907 / 125000 | 43|~1000 |
|enwiktionary | 194|~38800 | 1000| 0.025773|25773 / 1e+06 | 39|~995 |
|commonswiki | 151|~30200 | 1000| 0.033113|33113 / 1e+06 | 31|~975 |
|kowiki | 150|~30000 | 1000| 0.033333|33333 / 1e+06 | 31|~968 |
|arwiki | 117|~23400 | 1000| 0.042735|8547 / 2e+05 | 24|~975 |
|svwiki | 109|~21800 | 1000| 0.045872|2867 / 62500 | 22|~991 |
|trwiki | 109|~21800 | 1000| 0.045872|2867 / 62500 | 22|~991 |
|fiwiki | 88|~17600 | 1000| 0.056818|28409 / 5e+05 | 18|~978 |
|fawiki | 62|~12400 | 1000| 0.080645|16129 / 2e+05 | 13|~954 |
|viwiki | 61|~12200 | 1000| 0.081967|81967 / 1e+06 | 13|~939 |
|hewiki | 55|~11000 | 1000| 0.090909|90909 / 1e+06 | 12|~917 |
|huwiki | 51|~10200 | 1000| 0.098039|98039 / 1e+06 | 11|~928 |
|frwiktionary | 46|~9200 | 1000| 0.108696|13587 / 125000 | 10|~920 |
|idwiki | 44|~8800 | 1000| 0.113636|28409 / 250000 | 9|~978 |
|ruwiktionary | 43|~8600 | 1000| 0.116279|116279 / 1e+06 | 9|~956 |
|nowiki | 42|~8400 | 1000| 0.119048|14881 / 125000 | 9|~934 |
|ukwiki | 40|~8000 | 1000| 0.125000|1 / 8 | 8|~1000 |
|dewiktionary | 39|~7800 | 1000| 0.128205|25641 / 2e+05 | 8|~975 |
|cawiki | 25|~5000 | 1000| 0.200000|1 / 5 | 5|~1000 |
|rowiki | 22|~4400 | 1000| 0.227273|227273 / 1e+06 | 5|~880 |
|dawiki | 19|~3800 | 1000| 0.263158|131579 / 5e+05 | 4|~950 |
|skwiki | 17|~3400 | 1000| 0.294118|147059 / 5e+05 | 4|~850 |
|thwiki | 17|~3400 | 1000| 0.294118|147059 / 5e+05 | 4|~850 |
|elwiki | 16|~3200 | 1000| 0.312500|5 / 16 | 4|~800 |
|wikimania2017wiki | 16|~3200 | 1000| 0.312500|5 / 16 | 4|~800 |
|etwiki | 13|~2600 | 1000| 0.384615|76923 / 2e+05 | 3|~867 |
|plwiktionary | 13|~2600 | 1000| 0.384615|76923 / 2e+05 | 3|~867 |
|simplewiki | 13|~2600 | 1000| 0.384615|76923 / 2e+05 | 3|~867 |
|srwiki | 12|~2400 | 1000| 0.416667|416667 / 1e+06 | 3|~800 |
|bgwiki | 11|~2200 | 1000| 0.454545|90909 / 2e+05 | 3|~734 |
|enwikibooks | 10|~2000 | 1000| 0.500000|1 / 2 | 2|~1000 |
|enwikiquote | 10|~2000 | 1000| 0.500000|1 / 2 | 2|~1000 |
|hrwiki | 10|~2000 | 1000| 0.500000|1 / 2 | 2|~1000 |
|kawiki | 10|~2000 | 1000| 0.500000|1 / 2 | 2|~1000 |
|wikidatawiki | 10|~2000 | 1000| 0.500000|1 / 2 | 2|~1000 |
|cswiktionary | 9|~1800 | 1000| 0.555556|138889 / 250000 | 2|~900 |
|ltwiki | 9|~1800 | 1000| 0.555556|138889 / 250000 | 2|~900 |
|azwiki | 8|~1600 | 1000| 0.625000|5 / 8 | 2|~800 |
|hywiki | 8|~1600 | 1000| 0.625000|5 / 8 | 2|~800 |
|svwiktionary | 8|~1600 | 1000| 0.625000|5 / 8 | 2|~800 |
|slwiki | 7|~1400 | 1000| 0.714286|357143 / 5e+05 | 2|~700 |
|hiwiki | 6|~1200 | 1000| 0.833333|833333 / 1e+06 | 2|~600 |
|mediawikiwiki | 6|~1200 | 1000| 0.833333|833333 / 1e+06 | 2|~600 |
|mswiki | 6|~1200 | 1000| 0.833333|833333 / 1e+06 | 2|~600 |
This suggests that if we want to maintain a target sample of 1000 sessions per day to get reliable metrics, we should change the overall sampling rate to 1 in 10 and then specify custom ones (from the table) for enwiki (1 in 2000), dewiki (1 in 370), ruwiki (1 in 250), eswiki (1 in 200), frwiki (1 in 160), ..., commons (1 in 30), and fawiki (1 in 15)
Appendix
=======
```lang=R
library(tidyverse)
counts <- wmf::mysql_read("SELECT
wiki,
DATE(LEFT(timestamp, 8)) AS `date`,
COUNT(DISTINCT(event_mwSessionId)) AS sessions
FROM TestSearchSatisfaction2_16270835
WHERE
LEFT(timestamp, 6) = '201704'
AND (event_subTest IS NULL OR event_subTest = '')
GROUP BY wiki, `date`;", "log")
target_sample <- 1000
counts %>%
group_by(wiki) %>%
summarize(`med sessions/day in April 2017` = ceiling(median(sessions))) %>%
arrange(desc(`med sessions/day in April 2017`)) %>%
mutate(
total = `med sessions/day in April 2017` * 200,
target = target_sample,
`target rate (numeric)` = round(target / total, 6),
`target rate (fraction)` = vapply(`target rate (numeric)`, FRACTION::fra, ""),
`target rate ("one in")` = vapply(`target rate (fraction)`, . %>% strsplit(" / ") %>% { .[[1]] } %>% as.numeric %>% { .[2]/.[1] } %>% ceiling, 0),
`expected sample` = paste0("~", ceiling(total * (1/`target rate ("one in")`)))
) %>%
filter(`target rate (numeric)` < 1) %>%
mutate(total = paste0("~", total))
```