We probably have some old tasks related to this, but I wanted to use this task as an ideas dumping ground separate from the parent task, while I review research papers and ideas come up.
**The value of performance stability versus average performance**
We've floated the idea a few times in the past that we could study performance perception in the real world by making the site slower on purpose for a group of users and studying the effect on their behavior. I think a refinement on that idea would be both to introduce a high random variance in performance, as well as bad, but consistent performance.
It's possible that a website that runs fast 99% of the time and has a very slow page load every now and then is more frustrating to the user than one that is always average in its performance, absolutely consistently.
In the context of what our team is doing, if one phenomenon is a lot worse than the other, this would dramatically shift our focus. If consistency is the most important thing, tackling high percentiles should be our main focus. If faster response is the most important factor, it confirms that our current approach of focusing on making things faster across the board is the right one.
This could be studied in a controlled environment or "in the wild" by intentionally slowing down page loads. The challenge lies in how we measure that users are more satisfied with one scenario than another. By asking them? By measuring session length?
This might work as an opt-in study, with a browser plugin that either doesn't affect load time, inserts randomly slow pageloads, or slows down all pageloads if necessary so that page load time is made very consistent. Then measuring time spent on wikis over a long period of time.
//How results might be actionable: // if we find that stability is more important than average performance, it might encourage us focus on improving high percentiles and extreme cases more than performance across the board. If average performance matters more, this would reinforce our current focus.
**Performance perception thresholds and granularity**
In a controlled environment, it would be interesting to identify the latency threshold for the core mechanics we want to study (eg. reading, editing).
There is a limit to what humans can perceive as being instantaneous, and it seems to depend on context (since studies suggest that audio and haptic latency thresholds are different). It might also depend on age and background, again with studies suggesting that younger people have lower latency thresholds.
This would inform us on what the limit is, beyond which optimizing is pointless, as people can't tell the difference. We could also use this study to study how satisfied people are at different thresholds when they start perceiving latency. Taking a pessimistic example, if moving the needle from say 100ms to 30ms response time increases user satisfaction only from 80 to 82%, it might not justify some the budget allocated to a project aiming to achieve such performance improvements.
Studying this would require a lab setup and big enough cohort of participants.
//How results might be actionable: //knowing what users consider to be an "instant wiki pageload" would inform us on the point beyond which further optimization is futile. Furthermore, identifying the granularity for a perceivable performance difference, it would inform decisions to pursue a given optimization, if the expected savings are below the threshold of what users would perceive to be different.
**Gathering user performance scores and automatically guessing them**
We could run a continuous survey on a portion of pages where we capture RUM metrics, asking users after a pageload how fast/pleasant that pageload's performance was. Then from that trove of metrics, through statistical analysis and/or machine learning, find a way to reproduce the average human perceived performance score based on the RUM metrics alone. I think this would be a great project, because it would be tailored to our content, it would hopefully derive meaning from very fine-grained data that is currently disconnected from the real user experience, and it would keep human feedback in the loop. Beyond studying correlation (maybe there isn't enough with the set of RUM metrics we're dealing with), it would also give us a perceived performance thermometer over time, and very interesting data for researchers to work with.
//How results might be actionable: // Having a running score of user performance satisfaction would inform us on how well we're doing, accounting for all external factors. For example, if people develop higher expectations of speed with devices and other websites getting faster, this would allow us to realize that we're lagging behind, if our user perf score lowers while all other performance indicators don't. Having the ability to infer user scoring from low-level data would allow us to detect problematic pageloads and possibly identify patterns in what causes poor satisfaction, which could then inform decisions to optimize specific areas.