We probably have some old tasks related to this, but I wanted to use this task as an ideas dumping ground separate from the parent task, while I review research papers and ideas come up.
We've floated the idea a few times in the past that we could study performance perception in the real world by making the site slower on purpose for a group of users and studying the effect on their behavior. I think a refinement on that idea would be both to introduce a high random variance in performance, as well as bad, but consistent performance. It's possible that a website that runs fast 99% of the time and has a very slow page load every now and then is more frustrating to the user than one that is always average in its performance, absolutely consistently. In the context of what our team is doing, if one phenomenon is a lot worse than the other, this would dramatically shift our focus. If consistency is the most important thing, tackling high percentiles should be our main focus. If faster response is the most important factor, it confirms that our current approach of focusing on making things faster across the board is the right one.
In a controlled environment, it would be interesting to identify the latency threshold for the core mechanics we want to study (eg. reading, editing). There is a limit to what humans can perceive as being instantaneous, and it seems to depend on context (since studies suggest that audio and haptic latency thresholds are different). It might also depend on age and background, again with studies suggesting that younger people have lower latency thresholds. This would inform us on what the limit is, beyond which optimizing is pointless, as people can't tell the difference. We could also use this study to study how satisfied people are at different thresholds when they start perceiving latency. Taking a pessimistic example, if moving the needle from say 100ms to 30ms response time increases user satisfaction only from 80 to 82%, it might not justify some the budget allocated to a project aiming to achieve such performance improvements.
We could run a continuous survey on a portion of pages where we capture RUM metrics, asking users after a pageload how fast/pleasant that pageload's performance was. Then from that trove of metrics, through statistical analysis and/or machine learning, find a way to reproduce the average human perceived performance score based on the RUM metrics alone. I think this would be a great project, because it would be tailored to our content, it would hopefully derive meaning from very fine-grained data that is currently disconnected from the real user experience, and it would keep human feedback in the loop. Beyond studying correlation (maybe there isn't enough with the set of RUM metrics we're dealing with), it would also give us a perceived performance thermometer over time, and very interesting data for researchers to work with.