Based on the first round of section-level image suggestions evaluation results, we decided to do more work to remove image suggestions for sections with tables and lists (T330841, T330848), remove image suggestions for short sections (T329282), and remove image suggestions for sections that already have an image (T330516).
We also decided to do more work to refine P18, P373, and lead image based suggestions (T330773).
After those tickets are done, this ticket is to do another round of internal and ambassador manual evaluation using https://section-image-suggestions-test.toolforge.org/ -- but this time we would like include the number and % of clicks on "this section should not have an image".
Acceptance Criteria
- Update the test data with the results of T330516, T329282, T330841, T330848, and T330773
- Include the option "This image is offensive" in addition to "Good", "Bad", and "This section shouldn't have an image"
- Run another round of evaluation using https://section-image-suggestions-test.toolforge.org/ with updated data
- The outputs should include:
wiki | % good intersection | % good alignment | % good P18/P373/lead image | % sections that should not have an image | % offensive images | total rated suggestions |
- "% good" should mean the % of the total rated suggestions rated good, *not* the % of those that don't include "sections that should not have an image" or "this image is offensive" -- therefore, ratings for "this section should not have an image" and "this image is offensive" should be counted in the total rated suggestions time
Instructions for ambassadors:
- Evaluate 500 random section-level image suggestions across 500 random different articles, per wiki.
- Ambassadors will need to count and keep track of how many suggestions they have evaluated in their language -- the tool will not capture that.
- For each result for each unillustrated article, manually decide whether the match is good or bad; alternatively note if the image is offensive or if the section should not have an image. You can also choose "unsure" if you are not confident in your selection.
- General comments or questions during evaluation can be posted as comments in this ticket.
- The estimated time of work for manual evaluation is 3 hours for the 500 images. However, if the 3 hours are passed without finishing the test, please leave a comment and we can decide whether to continue with further evaluation.
Results
wiki | % good alignment | % good intersection | % good p18/p373/lead image | %sections that should not have an image | % offensive images | total rated suggestions |
arwiki | 71 | 91 | 54 | 6 | 0.4 | 511 |
bnwiki | 28 | 86 | 26 | 24 | 0 | 204 |
cswiki | 41 | 77 | 23 | 13 | 0 | 128 |
enwiki | 76 | 96 | 75 | 3 | 0 | 75 |
eswiki | 60 | 67 | 48 | 27 | 0.2 | 549 |
frwiki | N.A. | N.A. | 100 | N.A. | N.A. | 3 |
idwiki | 66 | 81 | 37 | 37 | 0 | 315 |
ptwiki | 92 | 100 | 84 | 0 | 0 | 85 |
ruwiki | 73 | 89 | 69 | 4 | 0 | 250 |
overall | 64 | 86 | 57 | 14 | 0.07 | 2,120 |