Page MenuHomePhabricator

Evaluating the user experience of AI systems
Closed, ResolvedPublic


Session title

Evaluating the user experience of AI systems

Main topic

T147708: Facilitate Wikidev'17 main topic "Artificial Intelligence to build and navigate content"

Type of activity

Unconference session


People who build AI systems often draw on human expertise or human behavior to develop their algorithms. For example, having Mechanical Turk workers manually classify data used to train a machine learning model, or using existing human annotations (e.g. Wikipedia Article Assessments) or behavior traces (e.g. clickthroughs) as features in a recommender system.

There is also a growing body of research (PDF) on how to get people to evaluate the output of these types of systems, post-deployment, and to use that feedback to improve the usefulness and relevance of the system output for end users.

1. The problem

The Wikimedia Foundation has begun to develop a variety of algorithmic systems and services to classify, rank, or recommend content and other types of contribution. So far, there has been no coordinated attempt to develop evaluation plans for our AI systems.

As WMF deploys more AI-based features, we will need to develop efficient and flexible methods for using human judgement to refine these algorithms over time and adapt them to new contexts.

2. Expected outcome

  • Ideas for how to evaluate the user experience of different types of AI systems (e.g. comparing outputs of different algorithms through MTurk/Crowdflower studies, A/B test user studies, interviews, microsurveys)
  • Identification for the kinds of user experience issues that current and proposed systems may face.
  • A discussion of the audience(s) and purpose of our current and proposed AI systems: who are we designing them for? What tasks/scenarios are we designing them to support?
  • Technical requirements for incorporating user feedback into deployed AI systems.
  • Best practices for the development of generalizable tools, workflows, and metrics for establishing benchmarks and testing system UX over time.

3. Current status of the discussion

We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, and Discernatron for CirrusSearch). These could potentially be expanded and generalized. Having a separate tool (or method) to evaluate each AI system doesn't seem efficient or sustainable.

4. Links

Proposed by

@Capt_Swing, @Halfak

Preferred group size


Any supplies that you would need to run the session

Post-its and markers

Interested attendees (sign up below)

Add your name here

Event Timeline

Capt_Swing renamed this task from Evaluating AI systems to Evaluating the user experience of AI systems .Oct 28 2016, 10:18 PM
Capt_Swing updated the task description. (Show Details)

This is a great topic. I'd like to discuss some social process-informed quantitative evaluations AIs. I did some writing about counter-vandalism prediction models here:

Would be very interested in this topic. Evaluation of such systems is easy to get wrong, and increasingly important, so it seems ripe for discussion.

Capt_Swing moved this task from Backlog to In Progress on the Design-Research board.
Quiddity updated the task description. (Show Details)

A large part of this seems to be about just having good feedback systems in general. New article patroller notices that a change marked as "probably vandalism" is perfectly fine and should not have been flagged -> ??? -> profit. Our current approach for ??? is "go find someone who knows what ORES is; then they can find someone who knows what Phabricator is; then that person can file a bug" which is not that great. OTOH uninformed bug reports can be a waste of both editor and developer time, and the unreliability of AI systems makes things harder (at what point should I report that ORES flags too many changes as vandalism, if the algorithm is expected to have a certain amount of false positives?)

This is closely related to T147929: Algorithmic dangers and transparency -- Best practices - can we explain to the user why the algorithm gave them that result, and how confident it is in the result? To enable effective review of potential algorithm bias by the editor community, it would be necessary to expose some internals (and preferably allow browsing the results by those internals, eg. "show me diffs which have been labeled as harmful based on the same criteria as this one"). If that information is available, editors have a fighting chance to spot systemic errors in the predictions.

We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, and Discernatron for CirrusSearch).

In what sense is CirrusSearch an AI system?

From a computer science perspective, information retrieval is firmly within the space of AI.

You know IBM's Watson? That's kind of like a really fancy CirrusSearch with some voice2text2voice.

@Tgr better workflows for end users to provide feedback is part of the solution. You can also evaluate AI systems with user studies (see McNee et al. 2006 (PDF)), and through post-deployment A/B testing (comparing metrics such as session length, bounce rate, lift, etc). I'd like to have a conversation that includes all of these options, plus any others I'm not familiar with. I agree it's related to T147929. Hopefully both proposals will be accepted, and the sessions won't be scheduled on top of one another.

@Capt_Swing Hey! As developer summit is less than four weeks from now, we are working on a plan to incorporate the ‘unconference sessions’ that have been proposed so far and would be generated on the spot. Thus, could you confirm if you plan to facilitate this session at the summit? Also, if your answer is 'YES,' I would like to encourage you to update/ arrange the task description fields to appear in the following format:

Session title
Main topic
Type of activity
Description Move ‘The Problem,' ‘Expected Outcome,' ‘Current status of the discussion’ and ‘Links’ to this section
Proposed by Your name linked to your MediaWiki URL, or profile elsewhere on the internet
Preferred group size
Any supplies that you would need to run the session e.g. post-its
Interested attendees (sign up below)

  1. Add your name here

We will be reaching out to the summit participants next week asking them to express their interest in unconference sessions by signing up.

To maintain the consistency, please consider referring to the template of the following task description:

@srishakatux confirmed that I plan to facilitate this session and will update the description per the provided template. Thanks!

@srishakatux I've updated the session description per the provided template. Let me know if I need to do anything else to get this approved. Thanks!

To the facilitator of this session: We have updated the unconference page with more instructions and faqs. Please review it in detail before the summit: If there are any questions or confusions, please ask! If your session gets a spot on the schedule, we would like you to read the session guidelines in detail: We would also then expect you to recruit Note-taker(s) 2(min) and 3 (max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be available in all the session rooms! See you at the summit! :)