Page MenuHomePhabricator

How might we measure readability on Vector22 and Minerva?
Closed, ResolvedPublic

Description

In the context of T341631, the team has been trying to work out a quantitative methodology for assessing readability on Vector22 and Minerva. So far, we haven't been able to land on a methodology that would give us data that we are confident in.

@Readability137 mentioned in T357770:

I would like commit myself to a respectful, end-user-focused dialogue on this change, informed by scientific data and actual user testing.
I'm also willing to carry out the user testing, provided we can reach a consensus on the methodology.
Additionally, does Wikipedia use A/B testing in any shape or form? If so, we could introduce the 1.5 line-height version to a segment of readers and analyze both the overall and detailed KPI's to see if there are any statistically significant effects.

Outcomes:

  • a quantitative methodology for assessing readability on Vector22
  • a quantitative methodology for assessing readability on Minerva
  • working agreements and assignees for next steps

Event Timeline

JScherer-WMF renamed this task from How might we measure readability on Vector22 and Minerva to How might we measure readability on Vector22 and Minerva?.Feb 26 2024, 2:56 PM
JScherer-WMF claimed this task.
JScherer-WMF updated the task description. (Show Details)
JScherer-WMF added a subscriber: Readability137.

@Readability137 what are your thoughts on methodologies?

Some studies we've read tend to use a combined readability metric based on speed and comprehension. They measure the speed of a participant reading a passage, and then give them some comprehension questions. Studies with access to eye-tracking labs tend to measure fixations as a proxy for efficiency rather than just speed. We don't have access to an eye-tracking lab, and lab-based studies don't really approximate the many reading contexts we support, especially on mobile.

Some of the challenges we've encountered so far:

  • There's no consistent end action for a reading experience. There's no "check out" like there is in an ecommerce flow, for example. As a result, we don't have a strong sense of when someone is finished reading on a particular topic. If they come to an article and leave it open when they switch to another task, for example, the data would be skewed.
  • Volunteer communities are very different than casual, anonymous reading communities that make up the vast majority of our readership. Ideally we would index any data we collect on casual reading experiences somehow. This makes recruitment for a study more difficult.
  • We don't collect data on any readers as a rule, and so we don't have a way to measure reading patterns for a particular reader across time.
  • Given that ~1 billion people per day read the wikis in hundreds of languages, statistical significance in a data set would mean a massive scale for the study.
  • We want any participants to give informed consent to participate in the study.
  • Ideally, we would have a broad representation of languages and scripts to compare in the data set.
  • One of the sticking points for us has been that we need to support quick scanning, long-form, in-depth reading, and many other reading styles that are hybrids of the two. We have no way of knowing if someone has come to an article to scan for a quick piece of information or read an entire article end-to-end.

For scanning, we might want to measure how quickly someone can land on an article and get a specific fact that they're looking for. One idea we had was to have a game where participants see how fast they can find a specific link on a page and click/tap it. That might approximate a scanning use case.

For long-form reading, we could find a specific group of articles/texts, prompt a reader to participate in the study somehow (this part is technically complicated/not feasible), ask them to read a passage, measure how long it took somehow (also technically complicated), and then ask them a question about it after the fact.

We'd love to hear any other ideas you might have about how to measure readability in casual reading experiences. Thanks again for your help!

Hi @Jdlrobson

Thanks for highlighting the challenges! 👍

I'd like to delve a bit deeper into two areas before I respond:

  1. Can I ask you what kind of user testing for readability you have done so far on the wikipedia pages (I am thinking the last 2 years or so)?

If you have any documentation, please feel free to share it :)

  1. Could you give us some context on the choice to avoid collecting numerical data from online readers? I'm aware of the usual concerns about tracking individuals, but I'm, perhaps somewhat naively, optimistic that a carefully crafted consent form might be welcomed by users who are logged in and are aligned with Wikipedia's mission.
  1. Can I ask you what kind of user testing for readability you have done so far on the wikipedia pages (I am thinking the last 2 years or so)?

Sure!

  1. Could you give us some context on the choice to avoid collecting numerical data from online readers? I'm aware of the usual concerns about tracking individuals, but I'm, perhaps somewhat naively, optimistic that a carefully crafted consent form might be welcomed by users who are logged in and are aligned with Wikipedia's mission.

As a rule, WMF doesn't want to track reader data at all. I agree that an opt-in study is an option. We have a relatively small list of folks who told us that they have created accounts specifically for reading who we could ask. The main reason we haven't done this is because of the selection bias that group of potential study participants would have. People who create accounts are more likely to be "power users" of Wikipedia. We would want a study population that mirrors logged-out "casual readers" as much as possible. Check out the research report above to get more information on what those folks tend to look like. The study would need to be quantitative to measure readability, and the cohort available to us doesn't even get close to the size we would need to make claims about statistical significance.

To get a cohort that size, we would need to have a scaled up study that interrupts the casual reading experiences of logged out readers in order to participate.

Assuming we use time and comprehension as the components of our "readability score", it's unclear how we would time a reading session. When would we stop the clock?

Thanks again for your collaboration on this!

Thanks for sharing the above.

I see a lot of nice stuff here:

  • in depth referring to general academic studies as well as studies specifically about wikipedia
  • gathering explicit feedback from hundreds of informed users
  • the accessebility form with the slider

Suggestion for additional user study:

What I would suggest adding here is real in situ user data gathered through user testing conducted by the team designing the wikipedia pages, aka the “go see for yourself” principle.

A concrete example in regard to the line-height change would be to, without too much preparing, grab a phone with the proposed line-height on it, go to a cafe nearby and ask 10 strangers to do a simple reading comprehension assignment (“can you read this article about tea”), watch their impressions closely and then ask a few follow up questions. Then repeat the process but with the normal line-height.

Iterate on this each week, document and increase scientific rigour as you learn more on how to conduct these kinds of test. It will let you to a higher degree bridge the gap of what users SAY they do and prefer VS what they ACTUALLY do and prefer.

Since the impressions you gather are personal and therefore very strong this approach carries some risk for self bias. There will also be a challenge regarding drawing quantitative conclusions from a low sample size and with a skewed selection group (“white middle aged New England college professors”).

Gathering quantitativa data:
The arguments against tracking power users make a lot of sense. I do believe however that even these users would display “casual” behaviour some of the time and this could perhaps be segmented out from the rest of the traffic (it could even be as easy as just doing a device segmentation).