An End-to-End Image Classification Pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AikoChou
	Mar 4 2021, 1:08 AM

Description

Synopsis

The Wikimedia Foundation Research team has been working on projects related to computer vision tools, such as prototypes of image classifiers trained on Commons categories. The research project aimed to develop prototypes to evaluate the feasibility of developing in-house computer vision tools to support future platform evolution.

We would now like to expand the work "classifying images based on Commons categories" into our pipeline, so that we can easily play with classified images in different ways. This project would be similar to what Aiko has done during Outreachy internship T233707 but much larger scale.

Project objective

Build an end-to-end image classification pipeline based on Commons categories

Possible Workflow

Start from a list of image names with labels or Commons categories from which we will infer the labels (e.g. ‘Quality_Images’)
- define on schema
On hadoop: retrieve image bytes by joining image / category list with image bytes table
- [for later] potentially add more metadata from images
Import image bytes on the local machine
- Which format
- Convert directly to TFRecords
Train a model using Tensorflow / Keras
- Finetune an existing model, e.g. Inception V4
- [for later] Train a model from scratch
On hadoop: retrieve the set of images to classify
Run inference on images
- Locally on a stat machine
- [for later] Distributed on hadoop - issue is how to have Keras loaded by all nodes
Save results back to Hive / parquet
- From a file imported from Local
- Directly from the distributed job
Generate a pipeline for all the above using Kubeflow
Think of how to make the data public
- Discussions with community members

We start with a micro pipeline and then iterate to improve/scale up the pipeline:

Select a small set of images from 2 subcategories in the category
Load the data on the local machine
Finetune a model on Keras to classify images
Run model inference on a sample of random images
Put the results back on hdfs

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T254768 [EPIC] Image recommendations proof-of-concept phase
Resolved	Miriam	T256081 Image matching algorithm
Open	Miriam	T215413 Image Classification Research and Development
Resolved	AikoChou	T276407 An End-to-End Image Classification Pipeline

Event Timeline

AikoChou created this task.Mar 4 2021, 1:08 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2021, 1:08 AM

Summary of the work done so far:

Imported the image data on local and saved to TFRecords files
Finetuned an Xception model to classify images between 'sculptures' and 'maiolica'
Ran inference on test data on local

On-going:

Trying to run distributed model inference using Keras base on this doc

AikoChou updated the task description. (Show Details)Mar 4 2021, 1:35 AM

Hi @AikoChou, thanks for looking into this! Wondering which project tag this task should have, so that others could also spot this task (and be aware of it, as it's sometimes easy in Wikimedia to work on the same things without knowing) when looking at project workboards... any idea? :) Is this SDC General? Does this already touch machine learning? Is T76886: Investigate computer vision image classification and description tools for shadow tags and search descriptions somehow related? Sorry for my cluelessness!

Miriam added a parent task: T215413: Image Classification Research and Development.Mar 4 2021, 9:44 AM

Miriam triaged this task as Medium priority.Mar 4 2021, 9:47 AM

Miriam added projects: Research, MachineVision.

Miriam added subscribers: fkaelin, elukey.

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptMar 4 2021, 9:47 AM

HI @Aklapper we added a few tags, this is mainly at a Research stage for now, so I included it in the Research board, and linked it to the parent epic task for image classification studies. Thanks for the heads up!

CBogen moved this task from Triage to Tracking on the Structured-Data-Backlog board.Mar 8 2021, 5:34 PM

Miriam mentioned this in T276791: Configure the Hadoop cluster to use the GPUs available on some workers.Mar 11 2021, 3:44 PM

Miriam added a parent task: T256081: Image matching algorithm.Mar 19 2021, 6:33 PM

Miriam edited projects, added Research (FY2020-21-Research-January-March); removed Research.

Weekly updates:

After converting a toy dataset of images into TFRecords, @AikoChou has trained a small model on stat1008 for object classification. We were then able to succesfully run model inference on hadoop.
We worked on improving the modeling part: @AikoChou has worked on training and evaluation different models for image quality classification using keras.

Hi @elukey,

I want to use tf-yarn to train a simple model on the cluster, but I found some environment variables need to be set up, which described in this doc:

JAVA_HOME: /usr/bin/java
HADOOP_HDFS_HOME: /usr/bin/hdfs

I'm not sure what to set for :

LD_LIBRARY_PATH(Include the path to libjvm.so and, optionally, the path to libhdfs.so): I don't find the libjvm.so
CLASSPATH (Hadoop jars must be added to the class path)
KRB5CCNAME (Path of Kerberos ticket cache if the Hadoop cluster is in secure mode)

Could you help me out with this? Thanks!

Hi!

So these should be the values that you need:

/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
/usr/lib/hadoop/bin/hadoop classpath --glob

For the KRB5CCNAME, in theory it should be sufficient /tmp/krb5cc_$(id -u)

Let me know how it goes!

leila moved this task from FY2020-21-Research-January-March to FY2020-21-Research-April-June on the Research board.Apr 29 2021, 1:01 AM

leila edited projects, added Research (FY2020-21-Research-April-June); removed Research (FY2020-21-Research-January-March).

Weekly update:

We solved the system errors which occur in training models using GPU-labeled nodes in the cluster. Currently, we're running parameter experiments to verify the functionality and compare the results on the CPU and GPU!

elukey awarded a token.May 3 2021, 10:29 AM

Weekly update:
We were able to get similar results across CPU and GPU computation. One major issue is that the worker dispatching weights to the GPU worker (the Parameter Server) is overloaded and saturates the network. we are investigating ways to reduce or re-distribute this load.

Weekly updates:
We solved the problem of the Parameter Server being the bottleneck for computation, for now, by increasing the batch size used for training. In the mean time, we are running several experiments to understand the difference between the Keras Model API vs the Tensorflow Estimator, as prediction accuracy seems to be much lower using the Estimator, i.e. the function which we need to use for large-scale distributed training.

Ladsgroup awarded a token.May 18 2021, 5:39 AM

Weekly updates:
We wrote documentation of distributed image inference workflow in the Github repo and provided three tasks as examples: image quality inference, face detection, and Resnet feature extraction. With regard to distributed training using tf-yarn, we are looking for an alternative to wrap a Keras model in Estimator to solve the accuracy issue.

Weekly updates:
Still trying to understand why the estimator performances become so low after moving from Keras to TF.Estimator. Investigation is ongoing and we are getting to the bottom of it.
There are different variables we are looking at:
(1) How the input data is formatted
(2) Whether the model is pre-trained or not
(3) The function used to transform the Keras model to an Estimator
Getting there!

Weekly updates:
We confirmed (1) How the input data is formatted and (3) The function used to transform the Keras model to an Estimator are not the cause of the poor performance for Estimator, as we trained a CNN model from scratch that can reach the same performance in both Keras and Estimator.

Now we are looking into whether the model is pre-trained that causes the poor performance and how to solve it. Also, we want to make sure the model learns similar things when training with Keras and Estimator.

Miriam mentioned this in T215413: Image Classification Research and Development.Jun 18 2021, 11:11 AM

Weekly updates:

We found out the low performance of TF.Estimator was because it didn't load pretrained weights properly. We solved the problem by using a snapshot of untrained model with pretrained weights as warmstart. We were able to get similar results with Keras and TF.Estimator. We are wrapping up things and writing documentation 😃

Weekly updates:

While wrapping up documentation, we are doing a couple of final tests:

measuring performance in terms of time/accuracy of GPU on cluster vs GPU on stat machine
concurrent training on GPUs on cluster

The first one gives very promising results, the second one will probably be listed as a limitation of the current system: due to the yarn labeling schema, having more than one job running on the same GPU-node seems hard.

Weekly update:

We refactored and parameterized the code so that it is easier for others to use the pipeline to train their own image classifiers. The code has been uploaded to a Github repository, and a usage guide is provided. More documentation about our previous experiments will be uploaded soon.

leila moved this task from FY2020-21-Research-April-June to FY2021-22-Research-Oct-Dec on the Research board.Oct 20 2021, 10:04 PM

leila edited projects, added Research (FY2021-22-Research-Oct-Dec); removed Research (FY2020-21-Research-April-June).

leila moved this task from FY2021-22-Research-Oct-Dec to FY2021-22-Research-April-June on the Research board.Apr 8 2022, 2:34 AM

leila edited projects, added Research (FY2021-22-Research-April-June); removed Research (FY2021-22-Research-Oct-Dec).

An implementation of this could be useful inside the MachineVision extension, to supplement API-based image tagging with a homebrewed model specifically for Commons :)

leila moved this task from FY2021-22-Research-April-June to In Progress on the Research board.Aug 26 2022, 7:31 PM

leila edited projects, added Research; removed Research (FY2021-22-Research-April-June).

Miriam closed this task as Resolved.Jun 29 2023, 11:10 AM

An End-to-End Image Classification PipelineClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

An End-to-End Image Classification Pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...