Evaluating the user experience of AI systems
Type of activity
People who build AI systems often draw on human expertise or human behavior to develop their algorithms. For example, having Mechanical Turk workers manually classify data used to train a machine learning model, or using existing human annotations (e.g. Wikipedia Article Assessments) or behavior traces (e.g. clickthroughs) as features in a recommender system.
There is also a growing body of research (PDF) on how to get people to evaluate the output of these types of systems, post-deployment, and to use that feedback to improve the usefulness and relevance of the system output for end users.
1. The problem
The Wikimedia Foundation has begun to develop a variety of algorithmic systems and services to classify, rank, or recommend content and other types of contribution. So far, there has been no coordinated attempt to develop evaluation plans for our AI systems.
As WMF deploys more AI-based features, we will need to develop efficient and flexible methods for using human judgement to refine these algorithms over time and adapt them to new contexts.
2. Expected outcome
- Ideas for how to evaluate the user experience of different types of AI systems (e.g. comparing outputs of different algorithms through MTurk/Crowdflower studies, A/B test user studies, interviews, microsurveys)
- Identification for the kinds of user experience issues that current and proposed systems may face.
- A discussion of the audience(s) and purpose of our current and proposed AI systems: who are we designing them for? What tasks/scenarios are we designing them to support?
- Technical requirements for incorporating user feedback into deployed AI systems.
- Best practices for the development of generalizable tools, workflows, and metrics for establishing benchmarks and testing system UX over time.
3. Current status of the discussion
We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, and Discernatron for CirrusSearch). These could potentially be expanded and generalized. Having a separate tool (or method) to evaluate each AI system doesn't seem efficient or sustainable.
- The Reading team is currently discussing a UX evaluation of RelatedArticles: T142009: Related Pages recommendations user study design
Preferred group size
Any supplies that you would need to run the session
Post-its and markers