[SPIKE] Image classifier prototype
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mfossati
	Dec 5 2023, 10:30 AM

Description

Train an image classifier to identify classes of images that are top candidates for deletion.
According to T340546: [XL] Analysis of deletion requests on Commons (see last Viable reasons frequency section), 13.4% of all deletions are:

logos
books
screenshots
album covers

The initial direction is to:

take a pre-trained EfficientNet V2 model
gather a dataset from Wikipedias (fair-use images) and/or Commons (free ones) with roughly 10k samples per class
fine-tune the model on our 4 classes with a train/validation dataset split
evaluate against a separate dataset of available images (class to be extracted from Commons categories)
evaluate against a separate dataset of deleted images (class to be extracted from their reason for deletion)

Results

Evaluation metrics

accuracy
area under the curve (AUC), computed separately for each class and then averaged across classes, see also here
- AUC precision/recall
- AUC ROC
model's loss function, i.e., categorical cross-entropy

Legend

all scores are percentages
best performances in bold
numbers in round brackets are the training epochs that obtained the best scores. Stars denote the same epoch for all metrics.

Dataset: available images

Source: Commons images with class categories

class	accuracy	AUC precision/recall	AUC ROC	loss	# samples
album	91.6 (4)	97.7 (19)	97.8 (19)	21.7 (4)	29,951
book	80.5 (25)	88.2 (12)	88.5 (22)	46.8 (3)	10,995
logo	96.9 (8*)	98.8	99	10.2	47,976
screenshot	90.5 (17*)	96	96.4	24.3	53,172

Dataset: deleted images

Source: T350020: Access request to deleted image files in the production Swift cluster

class	accuracy	AUC precision/recall	AUC ROC	loss	# samples
album	73.2 (16*)	79.2	80.2	65.1	1,292
book	64.5 (7)	68.2 (4)	69 (4)	79.7 (4)	4,882
logo	87.5 (5)	90 (11)	91.6 (11)	47.9 (13)	21,020
screenshot	62.7 (6)	67.2 (7)	68.9 (7)	77.2 (1)	4,740

Observations

the logo classifier is clearly the best one
all performances decrease with the deleted images dataset. Based on manual checks, the dataset looks more noisy compared to the available images one. This is possibly caused by:
- the image extraction method, i.e., reason for deletion text VS Commons categories
- the randomness of non-class samples, which seems to have penalized the classifiers

Code

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T349641 [EPIC] MVP Logo machine detection on Commons
		Resolved		mfossati	T352748 [SPIKE] Image classifier prototype

Event Timeline

mfossati created this task.Dec 5 2023, 10:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 5 2023, 10:30 AM

mfossati added a parent task: T349641: [EPIC] MVP Logo machine detection on Commons.Dec 5 2023, 10:31 AM

mfossati updated the task description. (Show Details)Dec 15 2023, 10:58 AM

Very first code snippet that classifies an image according to ImageNet classes (not yet our use case).

mfossati changed the task status from Open to In Progress.Dec 15 2023, 12:18 PM

mfossati claimed this task.

Script that gathers an image dataset given a PetScan ID.

mfossati updated the task description. (Show Details)Dec 21 2023, 1:44 PM

First attempt to fine-tune a pre-trained EfficientNet model.

mfossati updated the task description. (Show Details)Dec 22 2023, 6:26 PM

Binary image classifiers seem to work better than a single multiclass one.
The main hypothesis is that the injection of out-of-domain data worsens the overall performance.

We report below the best performances of each binary classifier.

class	max precision	max recall	# test samples
album	99	85	17,129
book	97	94	7,429
logo	99.9	97	23,866
screenshot	99	88	28,782

NOTE: while waiting for T350020, we gathered test datasets from available Commons and English Wikipedia images. Test samples don't occur in the training sets.

NOTE: all test datasets include the same set of 1992 randomly sampled out-of-domain images. As a result, all datasets are unbalanced.

NOTE: the prediction score threshold is 0.5.

mfossati mentioned this in T356040: [SPIKE] Analysis of Commons images deleted due to freedom of panorama issues.Jan 29 2024, 11:40 AM

mfossati closed this task as Resolved.Feb 28 2024, 12:20 PM

mfossati updated the task description. (Show Details)

mfossati reopened this task as Open.Feb 28 2024, 12:23 PM

mfossati moved this task from Doing to Needs Design on the Structured-Data-Backlog (Current Work) board.

mfossati updated the task description. (Show Details)Feb 28 2024, 2:49 PM

mfossati mentioned this in T358676: Host a logo detection model for Commons images.Feb 28 2024, 3:11 PM

MarkTraceur moved this task from Needs Design to Blocked on the Structured-Data-Backlog (Current Work) board.Mar 18 2024, 4:41 PM

mfossati closed this task as Resolved.Apr 29 2024, 3:59 PM

AUgolnikova-WMF mentioned this in T349641: [EPIC] MVP Logo machine detection on Commons.Jun 17 2024, 3:05 PM