Page MenuHomePhabricator

Make it possible to stop a survey after receiving a certain number of responses
Open, MediumPublic

Description

If you know that you need n responses to have statistical significance, then you should stop asking your question after recording that many responses.

This could be combined with T184752: Make it easy to schedule short-term microsurveys, to start and stop at pre-defined times to limit a survey by both a maximum number of responses received and a given date.

Event Timeline

Whatamidoing-WMF created this task.

I'd have to pull in someone with more stats background to be sure, but I think that cutting off a survey at a specific number of responses can be a bit too early. An example I've seen used:

  • You need 10k responses for a statistically significant survey
  • Your responses differ based on time of day, and day of week (ex: weekend users are different than weekday uesrs, 12 UTC users are different than 23UTC users)
  • You set some sampling rate and run the survey
  • After 7 days you check and have 8k responses. Keep the survey going
  • Affter another 7 days you check and have 17k responses. Stop the survey.
  • The survey data should be sliced to exactly 14 days prior to analysis

Cutting off a survey after, for example, 36 hours would mean some hours of the day are over-represented and the weekend vs weekday shift isn't recorded.

@mpopov or @chelsyx might be able to say with a bit more authority. where i mixed things up.

Seems like a useful feature to me.

My 2cnt (with some statistical knowledge, though it is not main part of my job)

  • It is already quite good if you actually made the calculations to determine the needed sample size (this is often not done)
  • It is also quite good if we are concerned with sampling being dependent on day/time but is is also something which I would see as good, but not essential for a first version (It needs also some knowledge about the effects you can expect… latest if your summer users are different than your winter users it gets tricky)

I don't think limiting a survey to a predetermined maximum of n responses is the way to go for the reasons @EBernhardson mentioned. The limiting factor should only be the start and end times/dates, ideally so that all timezones are equally represented.

That is, suppose I launch a survey over the weekend and my survey gets the max responses in 36 hours. Historically, desktop traffic is down on weekends and mobile traffic is up. Most likely I now have a biased dataset where mobile device users' opinions are way over-represented. (Note that the opposite situation would hold on weekdays.) So I have to figure out weights to weigh the responses with to yield balanced estimates to correct for the imbalance, and that's tricky. Then I realize that most of my responses came from US users because I started Saturday 9AM Pacific time, and the timezone issue makes things more complicated.

Compared to just running a survey for 1 week where you're going to get a more equal distribution of responses across devices and time zones / locations. Yes, which week of the month and which month of the year are going to have some effect but it's probably so minor compared to those two major ones.

Furthermore, another reason to stay away from count-based limits is if the survey is on a per-article basis. Suppose you use the same sampling rate across all articles. Some articles are more popular than others so their surveys get done faster. So you might have all responses you need for half of your articles while still waiting for the rest. Either you start analyzing the data with half of the surveys still live and weigh the responses post-hoc, or you just wait until all surveys are done. If you wait, the populations from which samples are selected to respond to the surveys are going to differ dramatically. The people who have answered a survey on a popular article within the first 6 hours are going to be so different than people who have answered a survey on an unpopular article within the first 30 days.

And finally, sample size calculations based on a priori power analysis (what @Jan_Dittrich was referring to) estimate the minimum sample size to achieve a specific power (reliability of a test to detect an effect at a predetermined significance level). If a power analysis yields n, one should not stop after obtaining n data points. In fact, if you stop the survey right when n responses are reached but the actual effect size turns out to be smaller than what was guessed during the power analysis, you might not be able to detect that effect with statistical significance. (This is also why people should pay more attention to the practical significance of an effect, not the statistical significance when they can. A tiny effect size that has no practical importance can become statistically significant with a large enough sample size. Although practical vs statistical significance can be a weird issue if you're publishing in a medical or scientific journal with ramifications on health and public policy.)

I hope that made things clearer, not foggier.

@mpopov—I agree 100% with everything you said about doing a survey right! (And it leads me to think we could use some good documentation (or a wikibook!) somewhere on best practices on how to run a survey, especially advice on a priori power analysis and the dangers of stopping your survey early, along with documentation on how to get the needed info (e.g., page view stats), etc.)

I think there's still a place for an absolute limit—not as part of proper survey procedure, but as a kind of safety valve. Someone could misplace a decimal point (esp. inputting in 0.74% as 0.74==74%!) or mixing up milliseconds and seconds, and hit their target for the week in a couple of hours or even a few minutes.

It could be couched as an "expected number of survey responses"—and when you hit, say, 5x the expected number, your survey stops, because you misconfigured it, or you let it run too long, or something.

Agree with @mpopov . And @TJones , I can't agree more on documenting the best practices!

@egalvezwmf , I think you have best practices re: surveys documented somewhere already?

@Elitre , @Aklapper, yes that is where I've been documenting some things. We also have https://office.wikimedia.org/wiki/Surveys as a rough checklist of staff to follow when doing a survey. Its going to be hard to set all the guidelines, policies, and checklists up front. In my opinion, starting small and seeing how things go is the best way. Also, a survey is not just software. There is a lot that goes into planning a survey, as well as learning and using results. That is something that a tool cannot fully solve for, and we may want to think about the people processes as well as the tool. In general, asking people to take surveys where we don't use results would be unethical and could harm response rates over time for some of our audiences. I'm happy to help advise as needed!

matmarex added a subscriber: matmarex.

Removing MediaWiki-Page-editing so that this doesn't clutter searches related to actually editing pages in MediaWiki. This project should only be on the parent task, probably (T89970). [batch edit]