Image Classification Research and Development
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Miriam
	Feb 6 2019, 1:44 PM

Description

This is the master task gathering our efforts towards developing in-house image classification models to be used across the organization.
It includes tasks on estimating data size and access, resource availability, model development and product applications.

Image data: @Miriam + @Gilles to work on estimating the size of Commons image corpus at different resolutions. T215250
GPUs: @elukey + @EBernhardson to work on connecting the GPU to stat1005 when time allows; Miriam will test GPU models afterwards. See the GPU task here. It was suggested that GPUs are useful for others in Research (e.g. @diego and @Isaac) and Search working on text analysis. T148843
Evaluating existing classifiers: This is a short-term effort towards developing our own classification models. The Research team will work on a protocol for evaluating generalisability and biases of existing image classifiers that SDC (@Ramsey-WMF @Abit @dr0ptp4kt @Cparle) or others (@MusikAnimal) might want to use, based on diverse image sets from Wikidata/Commons. The Research team will also help with the integration between Wikidata items and the labels from existing image classifiers.
Longer-term: Training our own image classifiers: The longer term plan, when data and processing units will be available, is to train our own image classifiers for various purposes: object detection, adult image filtering, image quality, image authenticity etc. T331134

Related Objects
Search...

Status	Assigned	Task
Open	Miriam	T215413 Image Classification Research and Development
Resolved	elukey	T148843 Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Declined	• Cmjohnson	T151080 check stat1004 (or another identical R430) for PCIe expansion space
Resolved	RobH	T159838 EQIAD: stat1002 replacement
		Unknown Object (Task)
Resolved	Ottomata	T165368 rack/setup/install replacement to stat1005 (stat1002 replacement)
Declined	None	T151904 User limits for stat machines. Limit space on /home dir and possibly /tmp
Resolved	elukey	T216226 GPU upgrade for stat1005
Resolved	• Cmjohnson	T216528 confirm gpu form factor in stat1005
		Unknown Object (Task)
Resolved	• Cmjohnson	T219522 install new GPU in stat1005
Resolved	elukey	T220784 Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created
Resolved	Miriam	T221761 Test GPUs with an end-to-end training task (Photo vs Graphics image classifier)
Declined	• Gilles	T220811 Test Thumbor OpenCL smart cropping on stat1005
Resolved	jijiki	T221562 Build Thumbor packages for buster
Resolved	elukey	T224723 Import AMD rocm packages in wikimedia-buster
Resolved	• Gilles	T215250 Estimate size of Commons image corpus at given resolution
Resolved	Miriam	T221934 Visualize Wiki Commons Images
Invalid	Miriam	T228441 Design a pipeline for image classification
Resolved	Miriam	T242229 Test the feasibility of a classifier trained on Commons categories
Resolved	Miriam	T242969 A list of meaningful Commons Categories whose images can be used to train image classifiers
Resolved	Miriam	T242970 A set of prototypes of image classifiers trained on images from Commons Categories
Resolved	Miriam	T242971 A report on accuracy and performance of the classification models
Declined	Miriam	T248692 Train image classifiers based on Commons Categories from scratch.
Resolved	Miriam	T250150 Improve prototypes of image classifiers trained on images from Commons Categories
Resolved	AikoChou	T276407 An End-to-End Image Classification Pipeline
Open	tizianopiccardi	T331134 A Generic Topic Classifier for Images on Commons
Open	Miriam	T341878 Explore the feasability of an alt text model using existing captioning methods

Event Timeline

Miriam created this task.Feb 6 2019, 1:44 PM

elukey added a project: Analytics.Feb 6 2019, 1:47 PM

dcausse added a project: Discovery-Search.Feb 6 2019, 2:08 PM

dcausse moved this task from needs triage to watching / waiting on the Discovery-Search board.

Krenair subscribed.Feb 6 2019, 2:09 PM

fgiunchedi awarded a token.Feb 6 2019, 4:49 PM

• Ramsey-WMF added projects: Multimedia, SDC General.Feb 6 2019, 8:17 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptFeb 6 2019, 8:17 PM

• Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Feb 6 2019, 8:17 PM

• Ramsey-WMF moved this task from Backlog to External tools on the SDC General board.

• PDrouin-WMF subscribed.Feb 6 2019, 8:19 PM

leila awarded a token.Feb 6 2019, 9:23 PM

There is an image classifier worth building that probably wouldn't fall into preexisting politically challenging bias, which is determining whether an image is a photograph or not. We have this long-standing limitation of only visually optimising (slight sharpening) thumbnails for JPGs because they're the only file type that's mostly photographs. Which leaves thumbnails of photographs uploaded as PNG and TIFF visually flat. See T192744 for some context.

An image classifier that can tell photographs apart from diagrams, maps, schematics, etc. would be quite useful for the visual quality of the thumbnails we render. Either by being directly inserted into our thumbnailing pipeline at the time thumbnails are rendered, or by tagging images with structured data (which would allow humans to override the decision made by the classifier) that would inform the thumbnailing process.

• Mholloway subscribed.Feb 7 2019, 4:56 PM

If we go down that pathway of trying to identify what images are photographs, we should look into work by a former colleague of mine on detecting visualizations on Commons (in some ways, the inverse task): http://brenthecht.com/publications/www18_vizbywiki.pdf

He (Allen Lin) might have some insight into some easy wins or pitfalls in building a model like that.

• fdans moved this task from Incoming to Radar on the Analytics board.Feb 7 2019, 6:10 PM

• iamjessklein awarded a token.Feb 7 2019, 6:25 PM

dr0ptp4kt added a project: Reading-Admin.Feb 8 2019, 3:24 PM

dr0ptp4kt moved this task from Backlog to Adam Radar on the Reading-Admin board.

@Gilles thanks for this! Images and graphics have very different underlying image statistics: it is therefore fairly easy for a classifier to tell them a part. So it should be feasible.

If we can collect some training data, by finding one or more categories in Commons with a substantial number of diverse graphics images, I can try to quickly build a graphics VS photo classifier, by finetuning an existing image classifier (it won't be perfect, but no GPU needed ;) ) @Isaac maybe your colleague can help with this, by sharing which categories and keywords he used to create his training data?

As a side note, such a classifier can be helpful also to improve the accuracy of other image classifiers (e.g. object detectors or image quality classifiers), that are tipycally trained on photographic material and therefore fail completely when classifying non-photographic images.
We did studies in the past to quantitavely explain the importance and the nature of the difference between graphics and images: https://www.dropbox.com/s/y97h8kjx84hbrzk/p242-redi.pdf?dl=0

Fuzheado subscribed.Feb 10 2019, 8:51 PM

• SandraF_WMF subscribed.Feb 11 2019, 9:10 AM

• SandraF_WMF awarded a token.Feb 11 2019, 10:34 AM

akosiaris subscribed.Feb 11 2019, 5:48 PM

CDanis subscribed.Feb 11 2019, 5:48 PM

• Mholloway unsubscribed.Feb 11 2019, 7:35 PM

MoritzMuehlenhoff subscribed.Feb 12 2019, 2:42 PM

Miriam added a subtask: T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models.Feb 12 2019, 4:41 PM

Miriam added a subtask: T215250: Estimate size of Commons image corpus at given resolution.

FYI, some developments in the area of using image classification in the Wikiverse:

We now have a Wikidata Distributed Game - Depicts that uses image classification ML to generate candidates. This was done as a project I did with The Met Museum and Microsoft.

https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2019/Contents/USA_report

elukey changed the status of subtask T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models from Open to Stalled.Mar 28 2019, 9:23 AM

elukey changed the status of subtask T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models from Stalled to Open.Apr 2 2019, 5:17 PM

Miriam mentioned this in T221761: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier).Apr 24 2019, 11:16 AM

Cirdan subscribed.Apr 24 2019, 3:03 PM

Jheald subscribed.May 3 2019, 10:46 PM

Not sure if this is relevant, but this seemed the best place to note.

I just came across:
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN
It seems relatively easy to package up (e.g. on a notebook host) and ship to hdfs and then include it in a spark job.

• Mholloway subscribed.Jun 15 2019, 2:50 PM

Miriam moved this task from Backlog to In Progress on the Research board.Jul 11 2019, 3:33 PM

Miriam added a subtask: T228441: Design a pipeline for image classification.Jul 18 2019, 3:39 PM

• Gilles closed subtask T215250: Estimate size of Commons image corpus at given resolution as Resolved.Sep 25 2019, 1:50 PM

• Nuria closed subtask T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models as Resolved.Oct 10 2019, 3:24 PM

• Nuria updated the task description. (Show Details)Oct 21 2019, 2:36 AM

Miriam closed subtask T221934: Visualize Wiki Commons Images as Resolved.Jan 2 2020, 3:26 PM

Capankajsmilyo subscribed.Jan 2 2020, 5:58 PM

Harej unsubscribed.Feb 11 2020, 1:08 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

Miriam added a subtask: T276407: An End-to-End Image Classification Pipeline.Mar 4 2021, 9:44 AM

Miriam updated the task description. (Show Details)Mar 4 2021, 9:49 AM