**Type of activity:** Pre-scheduled session
**Main topic:** {T147708}
Developers of AI systems oftensometimes draw on human expertise or human behavior to develop their algorithms,. usually byFor example, having peopleMechanical Turk workers manually classify a set of traidata used to train a machine learning datamodel, or using pre-existing labelhuman annotations (e.g. Wikipedia Article Assessments) or behavior traces (e.g. clickthroughs). as features in a recommender system.
There is also a growing body of research on how to use realget people to evaluate the output of these types of systems, post-deployment, and to use that feedback to improve the accuracy and relevance of the system output for end users.
== The problem ==
The Wikimedia Foundation has begun to develop a variety of systems that use learning algorithmalgorithmic systems and services to classify, rank, or recommend content and other types of contribution.
So far, there has been no coordinated attempt to develop evaluation plans for our AI systems as a whole. .
As WMF deploys more AI-based features, we will need to experimdevelop efficient with differentand flexible methods for using human judgement to refine these algorithms over time and adapt them to new contexts.
== Expected outcome ==
- Suggestions- Ideas for how to evaluate different types of AI systems.the user experience of different types of AI systems (e.g. comparing outputs of different algorithms through MTurk/Crowdflower studies, A/B test user studies, interviews, microsurveys)
- Identification for the kinds of user experience issues that current and proposed systems may face.
- A discussion of the audience(s) and purpose of our current and proposed AI systems: who are we designing them for? What tasks/scenarios are we designing them to support?
- Technical requirements for incorporating user feedback into deployed AI systems.
- Best practices for the development of generalizable tools, workflows, and metrics for establishing benchmarks testing system UX over time.
== Current status of the discussion ==
Existing systemsWe have already developed a few systems to trainn and test some over our AI systems (e.g WikiLabels for ORES, Disceratron for CirrusSearch). Crowdflower for Detox)These could potentially be expanded to evaluate current outputs and feed re-labeled data back into the systemgeneralized. Researchers elsewhere have used surveys and user study methodologies to achieve the same effect.Having a separate tool (or method) to evaluate each AI system doesn't seem efficient or sustainable.
The Reading team is currently discussing a UX evaluation of #RelatedArticles: {T142009}
== Links ==
* ...