Change Details

**Open Questions** [] How can we grade all 3000 options by the end of the experiment? How many people will we need in order to accomplish this over what length of time? [] Do we want people to tell us why they rejected/skipped? Which options? [] How many recommendations should we show each user? [] How could we measure user ‘accuracy’? Similarly how could we measure algorithm quality? For example in the user tests, Marshall and I added a column denoting “Correct decision” (as deemed by us) and compared it to the participant responses to see the breakdown of Accurate/Inaccurate responses. How could we do that here? [] Do we want to get answers for a lot of recommendations or have a lot of recommendations exposed? How many users do we need to show it to? Which will help the algorithm more? [] What is the volume that comes from article descriptions per day? This question is intended to serve as a precursor to extrapolating if 9000 is a high number and when Android users would complete 9000 total judgements [x] Do we want to show it to everyone, language requirements? [] How will we know people like the task? [] Can we tell if things are easy vs hard? [] Can we see how people like it and if so, by user type? [x] What will be our check points to determine if we tweak [x] Do we want to test how well users understand and their ability to write good captions for the accepted recommendations? [] How do we learn what supplementary information users need to make good decisions? [] Which info (image categories, source of suggestion, image resolution, etc) do people refer to the most in making their decisions? [] Which information leads to more accurate ratings? [] Do we also want to enable the ability for users to filter suggestions by interest topics ORES in this test tool? [] Practical usage by (esp. experienced) editors - do we want to monitor if people start using the tool in unintended ways? Such as using it to find unillustrated articles to add their own images? **Product Decisions** - We will have one suggested image per article instead of multiple images - This iteration of the MVP will not include Image Captions - There are no language constraints for this task. As long as there is an article available in the language we will surface it. We want to be deliberate in ensuring this task is completed by a variety of languages. For this MVP to be considered a success, we want the task completed in at least five different languages including English, an indic language and Latin language. - We will have a check point two weeks after the launch of the feature to check if the feature is working properly and if modifications need to be made in order to ensure we are getting the answers to our core questions. The check point is not intended to introduce scope creep.

**Open Questions** [] How can we grade all 3000 options by the end of the experiment? How many people will we need in order to accomplish this over what length of time? [] Do we want people to tell us why they rejected/skipped? Which options? [] How many recommendations should we show each user? [] How could we measure user ‘accuracy’? Similarly how could we measure algorithm quality? For example in the user tests, Marshall and I added a column denoting “Correct decision” (as deemed by us) and compared it to the participant responses to see the breakdown of Accurate/Inaccurate responses. How could we do that here? [] Do we want to get answers for a lot of recommendations or have a lot of recommendations exposed? How many users do we need to show it to? Which will help the algorithm more? [] What is the volume that comes from article descriptions per day? This question is intended to serve as a precursor to extrapolating if 9000 is a high number and when Android users would complete 9000 total judgements [x] Do we want to show it to everyone, language requirements? [] How will we know people like the task? [] Can we tell if things are easy vs hard? [] Can we see how people like it and if so, by user type? [x] What will be our check points to determine if we tweak [x] Do we want to test how well users understand and their ability to write good captions for the accepted recommendations? [] How do we learn what supplementary information users need to make good decisions? [] Which info (image categories, source of suggestion, image resolution, etc) do people refer to the most in making their decisions? [] Which information leads to more accurate ratings? [x] Do we also want to enable the ability for users to filter suggestions by interest topics ORES in this test tool? [] Practical usage by (esp. experienced) editors - do we want to monitor if people start using the tool in unintended ways? Such as using it to find unillustrated articles to add their own images? **Product Decisions** - We will have one suggested image per article instead of multiple images - This iteration of the MVP will not include Image Captions - There are no language constraints for this task. As long as there is an article available in the language we will surface it. We want to be deliberate in ensuring this task is completed by a variety of languages. For this MVP to be considered a success, we want the task completed in at least five different languages including English, an indic language and Latin language. - We will have a check point two weeks after the launch of the feature to check if the feature is working properly and if modifications need to be made in order to ensure we are getting the answers to our core questions. The check point is not intended to introduce scope creep. - We aren't able to filter by categories in this iteration of the MVP, but it could be a possibility in the future through the CPT API