Evaluating the user experience of AI systems
Evaluating the user experience of AI systems

T147708: Facilitate Wikidev'17 main topic "Artificial Intelligence to build and navigate content"

Unconference session


People who build AI systems often draw on human expertise or human behavior to develop their algorithms. For example, having Mechanical Turk workers manually classify data used to train a machine learning model, or using existing human annotations (e.g. Wikipedia Article Assessments) or behavior traces (e.g. clickthroughs) as features in a recommender system.

There is also a growing body of research (PDF) on how to get people to evaluate the output of these types of systems, post-deployment, and to use that feedback to improve the usefulness and relevance of the system output for end users.

1. The problem

The Wikimedia Foundation has begun to develop a variety of algorithmic systems and services to classify, rank, or recommend content and other types of contribution. So far, there has been no coordinated attempt to develop evaluation plans for our AI systems.

As WMF deploys more AI-based features, we will need to develop efficient and flexible methods for using human judgement to refine these algorithms over time and adapt them to new contexts.

2. Expected outcome

  • Ideas for how to evaluate the user experience of different types of AI systems (e.g. comparing outputs of different algorithms through MTurk/Crowdflower studies, A/B test user studies, interviews, microsurveys)
  • Identification for the kinds of user experience issues that current and proposed systems may face.
  • A discussion of the audience(s) and purpose of our current and proposed AI systems: who are we designing them for? What tasks/scenarios are we designing them to support?
  • Technical requirements for incorporating user feedback into deployed AI systems.
  • Best practices for the development of generalizable tools, workflows, and metrics for establishing benchmarks and testing system UX over time.

3. Current status of the discussion

We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, and Discernatron for CirrusSearch). These could potentially be expanded and generalized. Having a separate tool (or method) to evaluate each AI system doesn't seem efficient or sustainable.

@Capt_Swing, @Halfak

Post-its and markers

This is a great topic. I'd like to discuss some social process-informed quantitative evaluations AIs. I did some writing about counter-vandalism prediction models here:

Would be very interested in this topic. Evaluation of such systems is easy to get wrong, and increasingly important, so it seems ripe for discussion.

A large part of this seems to be about just having good feedback systems in general. New article patroller notices that a change marked as "probably vandalism" is perfectly fine and should not have been flagged -> ??? -> profit. Our current approach for ??? is "go find someone who knows what ORES is; then they can find someone who knows what Phabricator is; then that person can file a bug" which is not that great. OTOH uninformed bug reports can be a waste of both editor and developer time, and the unreliability of AI systems makes things harder (at what point should I report that ORES flags too many changes as vandalism, if the algorithm is expected to have a certain amount of false positives?)

This is closely related to T147929: Algorithmic dangers and transparency -- Best practices - can we explain to the user why the algorithm gave them that result, and how confident it is in the result? To enable effective review of potential algorithm bias by the editor community, it would be necessary to expose some internals (and preferably allow browsing the results by those internals, eg. "show me diffs which have been labeled as harmful based on the same criteria as this one"). If that information is available, editors have a fighting chance to spot systemic errors in the predictions.

We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, and Discernatron for CirrusSearch).

In what sense is CirrusSearch an AI system?

From a computer science perspective, information retrieval is firmly within the space of AI.

You know IBM's Watson? That's kind of like a really fancy CirrusSearch with some voice2text2voice.

@Tgr better workflows for end users to provide feedback is part of the solution. You can also evaluate AI systems with user studies (see McNee et al. 2006 (PDF)), and through post-deployment A/B testing (comparing metrics such as session length, bounce rate, lift, etc). I'd like to have a conversation that includes all of these options, plus any others I'm not familiar with. I agree it's related to T147929. Hopefully both proposals will be accepted, and the sessions won't be scheduled on top of one another.

