Page MenuHomePhabricator

[SPIKE] Image classifier prototype
Closed, ResolvedPublic

Description

Train an image classifier to identify classes of images that are top candidates for deletion.
According to T340546: [XL] Analysis of deletion requests on Commons (see last Viable reasons frequency section), 13.4% of all deletions are:

  1. logos
  2. books
  3. screenshots
  4. album covers

The initial direction is to:

  • take a pre-trained EfficientNet V2 model
  • gather a dataset from Wikipedias (fair-use images) and/or Commons (free ones) with roughly 10k samples per class
  • fine-tune the model on our 4 classes with a train/validation dataset split
  • evaluate against a separate dataset of available images (class to be extracted from Commons categories)
  • evaluate against a separate dataset of deleted images (class to be extracted from their reason for deletion)

Results

Evaluation metrics
Legend
  • all scores are percentages
  • best performances in bold
  • numbers in round brackets are the training epochs that obtained the best scores. Stars denote the same epoch for all metrics.
Dataset: available images

Source: Commons images with class categories

classaccuracyAUC precision/recallAUC ROCloss# samples
album91.6 (4)97.7 (19)97.8 (19)21.7 (4)29,951
book80.5 (25)88.2 (12)88.5 (22)46.8 (3)10,995
logo96.9 (8*)98.89910.247,976
screenshot90.5 (17*)9696.424.353,172
Dataset: deleted images

Source: T350020: Access request to deleted image files in the production Swift cluster

classaccuracyAUC precision/recallAUC ROCloss# samples
album73.2 (16*)79.280.265.11,292
book64.5 (7)68.2 (4)69 (4)79.7 (4)4,882
logo87.5 (5)90 (11)91.6 (11)47.9 (13)21,020
screenshot62.7 (6)67.2 (7)68.9 (7)77.2 (1)4,740

Observations

  • the logo classifier is clearly the best one
  • all performances decrease with the deleted images dataset. Based on manual checks, the dataset looks more noisy compared to the available images one. This is possibly caused by:
    • the image extraction method, i.e., reason for deletion text VS Commons categories
    • the randomness of non-class samples, which seems to have penalized the classifiers

Code

Event Timeline

mfossati moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.

Very first code snippet that classifies an image according to ImageNet classes (not yet our use case).

mfossati changed the task status from Open to In Progress.Dec 15 2023, 12:18 PM
mfossati claimed this task.

Script that gathers an image dataset given a PetScan ID.

First attempt to fine-tune a pre-trained EfficientNet model.

Binary image classifiers seem to work better than a single multiclass one.
The main hypothesis is that the injection of out-of-domain data worsens the overall performance.

We report below the best performances of each binary classifier.

classmax precisionmax recall# test samples
album998517,129
book97947,429
logo99.99723,866
screenshot998828,782
NOTE: while waiting for T350020, we gathered test datasets from available Commons and English Wikipedia images. Test samples don't occur in the training sets.
NOTE: all test datasets include the same set of 1992 randomly sampled out-of-domain images. As a result, all datasets are unbalanced.
NOTE: the prediction score threshold is 0.5.
mfossati updated the task description. (Show Details)