The goal of this task is to create a tool to manually test the image recommendations POC API, once T260832 is complete. There will be a follow up task to do the manual testing itself.
T273527 is the task to publish the API spec, so once that is complete, work on this can start - though it can't be finished until T260832 is complete.
- The tool will evaluate results on Arabic, Cebuano, English, Vietnamese, Bengali and Czech wikis
- The tool will allow the user to choose which wiki/language they want to evaluate
- The tool will evaluate 500 unillustrated articles from each wiki
- The tool will run the API to get 500 random unillustrated articles from each wiki and all of their image recommendations
- The tool will ensure that the 500 unillustrated articles provide a (close to) equal number of results from the Image Recommendations Algorithm and from MediaSearch
- The tool will display and evaluate the output (both a preview of the article text and the image), similar to https://media-search-signal-test.toolforge.org/
- The tool will allow testers to manually decide whether the match is good, okay, or bad for each result for each unillustrated article. The tool will also allow users to say that they are unsure if the match is good.
- The tool will allow testers to manually decide whether the recommended image is explicit/NSFW (okay/explicit/unsure)
- The tool will output the results into a spreadsheet, showing how many good, okay, and bad matches were produced for each article, whether the annotator was "unsure", and what the source of each of those matches was
- Spreadsheet columns will be: wiki; article name; image; match strength (good/okay/bad/unsure); source (Wikidata, interlinks, Commons category, or MediaSearch); explicit
- The tool will log the API response time to evaluate performance.