Page MenuHomePhabricator

[M] Another round of manual evaluation for SLIS
Closed, ResolvedPublic

Description

Based on the first round of section-level image suggestions evaluation results, we decided to do more work to remove image suggestions for sections with tables and lists (T330841, T330848), remove image suggestions for short sections (T329282), and remove image suggestions for sections that already have an image (T330516).

We also decided to do more work to refine P18, P373, and lead image based suggestions (T330773).

After those tickets are done, this ticket is to do another round of internal and ambassador manual evaluation using https://section-image-suggestions-test.toolforge.org/ -- but this time we would like include the number and % of clicks on "this section should not have an image".

Acceptance Criteria

wiki% good intersection% good alignment% good P18/P373/lead image% sections that should not have an image% offensive imagestotal rated suggestions
  • "% good" should mean the % of the total rated suggestions rated good, *not* the % of those that don't include "sections that should not have an image" or "this image is offensive" -- therefore, ratings for "this section should not have an image" and "this image is offensive" should be counted in the total rated suggestions time

Instructions for ambassadors:

  • Evaluate 500 random section-level image suggestions across 500 random different articles, per wiki.
  • Ambassadors will need to count and keep track of how many suggestions they have evaluated in their language -- the tool will not capture that.
  • For each result for each unillustrated article, manually decide whether the match is good or bad; alternatively note if the image is offensive or if the section should not have an image. You can also choose "unsure" if you are not confident in your selection.
  • General comments or questions during evaluation can be posted as comments in this ticket.
  • The estimated time of work for manual evaluation is 3 hours for the 500 images. However, if the 3 hours are passed without finishing the test, please leave a comment and we can decide whether to continue with further evaluation.

Results

wiki% good alignment% good intersection% good p18/p373/lead image%sections that should not have an image% offensive imagestotal rated suggestions
arwiki71915460.4511
bnwiki288626240204
cswiki417723130128
enwiki7696753075
eswiki606748270.2549
frwikiN.A.N.A.100N.A.N.A.3
idwiki668137370315
ptwiki92100840085
ruwiki73896940250
overall648657140.072,120
NOTE: total rated suggestions exclude those marked as unsure.

Event Timeline

CBogen renamed this task from Another round of manual evaluation to calculate how many sections that shouldn't have images still generate suggestions to [M] Another round of manual evaluation to calculate how many sections that shouldn't have images still generate suggestions.EditedMar 22 2023, 4:24 PM
CBogen renamed this task from [M] Another round of manual evaluation to calculate how many sections that shouldn't have images still generate suggestions to [M] Another round of manual evaluation for SLIS.Mar 23 2023, 2:07 PM
CBogen updated the task description. (Show Details)
CBogen updated the task description. (Show Details)
mfossati changed the task status from Open to In Progress.Apr 17 2023, 10:01 AM
mfossati claimed this task.
mfossati moved this task from Blocked to Doing on the Structured-Data-Backlog (Current Work) board.

Unblocking this to tackle the tool change (see second AC) first.

mfossati added a subscriber: Etonkovidova.

Data and tool ready, evaluation elicited. Moving to QA for the evaluation period ( @Etonkovidova, you can safely skip this ticket).

When testing the section image suggestions in Spanish, some of the suggestion are being repeated after completing them before (and not skipping them). For example, this is one of the articles/section repeated and the image suggested is always the same.

@mfossati thanks for posting the results! Is there a spreadsheet of raw data I can look at too?
Am I correct in interpreting that "intersection" suggestions are both a topic and alignment fit?

@mfossati thanks for posting the results! Is there a spreadsheet of raw data I can look at too?

@KStoller-WMF, here's a dump of the raw data: https://docs.google.com/spreadsheets/d/1wwLMbgnfXSnmrVw9PuzQqMs-Mr9vUk_kcddz5VBRzDo/edit?usp=sharing
Feel free to ping me for more detailed information about all the fields.

Am I correct in interpreting that "intersection" suggestions are both a topic and alignment fit?

Yep!