**Type of activity:** Pre-scheduled session
**Main topic:** {T147708}
Developers of AI systems sometimes draw on human expertise or human behavior to develop their algorithms. For example, having Mechanical Turk workers manually classify data used to train a machine learning model, or using existing human annotations (e.g. Wikipedia Article Assessments) or behavior traces (e.g. clickthroughs) as features in a recommender system.
There is also a growing body of research on how to get people to evaluate the output of these types of systems, post-deployment, and to use that feedback to improve the accuracy and relevance of the system output for end users.
== The problem ==
The Wikimedia Foundation has begun to develop a variety of algorithmic systems and services to classify, rank, or recommend content and other types of contribution. So far, there has been no coordinated attempt to develop evaluation plans for our AI systems.
As WMF deploys more AI-based features, we will need to develop efficient and flexible methods for using human judgement to refine these algorithms over time and adapt them to new contexts.
== Expected outcome ==
- Ideas for how to evaluate the user experience of different types of AI systems (e.g. comparing outputs of different algorithms through MTurk/Crowdflower studies, A/B test user studies, interviews, microsurveys)
- Identification for the kinds of user experience issues that current and proposed systems may face.
- A discussion of the audience(s) and purpose of our current and proposed AI systems: who are we designing them for? What tasks/scenarios are we designing them to support?
- Technical requirements for incorporating user feedback into deployed AI systems.
- Best practices for the development of generalizable tools, workflows, and metrics for establishing benchmarks testing system UX over time.
== Current status of the discussion ==
We have already developed a few systems to train and test some over our AI systems (e.g WikiLabels for ORES, Disceratron for CirrusSearch). These could potentially be expanded and generalized. Having a separate tool (or method) to evaluate each AI system doesn't seem efficient or sustainable.
The Reading team is currently discussing a UX evaluation of #RelatedArticles: {T142009}
== Links ==
* ...