Page MenuHomePhabricator

An End-to-End Image Classification Pipeline
Open, MediumPublic



The Wikimedia Foundation Research team has been working on projects related to computer vision tools, such as prototypes of image classifiers trained on Commons categories. The research project aimed to develop prototypes to evaluate the feasibility of developing in-house computer vision tools to support future platform evolution.

We would now like to expand the work "classifying images based on Commons categories" into our pipeline, so that we can easily play with classified images in different ways. This project would be similar to what Aiko has done during Outreachy internship T233707 but much larger scale.

Project objective

  • Build an end-to-end image classification pipeline based on Commons categories

Possible Workflow

  • Start from a list of image names with labels or Commons categories from which we will infer the labels (e.g. ‘Quality_Images’)
    • define on schema
  • On hadoop: retrieve image bytes by joining image / category list with image bytes table
    • [for later] potentially add more metadata from images
  • Import image bytes on the local machine
  • Train a model using Tensorflow / Keras
    • Finetune an existing model, e.g. Inception V4
    • [for later] Train a model from scratch
  • On hadoop: retrieve the set of images to classify
  • Run inference on images
    • Locally on a stat machine
    • [for later] Distributed on hadoop - issue is how to have Keras loaded by all nodes
  • Save results back to Hive / parquet
    • From a file imported from Local
    • Directly from the distributed job
  • Generate a pipeline for all the above using Kubeflow
  • Think of how to make the data public
    • Discussions with community members

We start with a micro pipeline and then iterate to improve/scale up the pipeline:

  1. Select a small set of images from 2 subcategories in the category
  2. Load the data on the local machine
  3. Finetune a model on Keras to classify images
  4. Run model inference on a sample of random images
  5. Put the results back on hdfs

Event Timeline

Summary of the work done so far:

  • Imported the image data on local and saved to TFRecords files
  • Finetuned an Xception model to classify images between 'sculptures' and 'maiolica'
  • Ran inference on test data on local


  • Trying to run distributed model inference using Keras base on this doc

Hi @AikoChou, thanks for looking into this! Wondering which project tag this task should have, so that others could also spot this task (and be aware of it, as it's sometimes easy in Wikimedia to work on the same things without knowing) when looking at project workboards... any idea? :) Is this SDC General? Does this already touch machine learning? Is T76886: Investigate computer vision image classification and description tools for shadow tags and search descriptions somehow related? Sorry for my cluelessness!

Miriam triaged this task as Medium priority.Mar 4 2021, 9:47 AM
Miriam added projects: Research, MachineVision.
Miriam added subscribers: fkaelin, elukey.

HI @Aklapper we added a few tags, this is mainly at a Research stage for now, so I included it in the Research board, and linked it to the parent epic task for image classification studies. Thanks for the heads up!

Weekly updates:

  • After converting a toy dataset of images into TFRecords, @AikoChou has trained a small model on stat1008 for object classification. We were then able to succesfully run model inference on hadoop.
  • We worked on improving the modeling part: @AikoChou has worked on training and evaluation different models for image quality classification using keras.

Hi @elukey,

I want to use tf-yarn to train a simple model on the cluster, but I found some environment variables need to be set up, which described in this doc:

  • JAVA_HOME: /usr/bin/java
  • HADOOP_HDFS_HOME: /usr/bin/hdfs

I'm not sure what to set for :

  • LD_LIBRARY_PATH(Include the path to and, optionally, the path to I don't find the
  • CLASSPATH (Hadoop jars must be added to the class path)
  • KRB5CCNAME (Path of Kerberos ticket cache if the Hadoop cluster is in secure mode)

Could you help me out with this? Thanks!


So these should be the values that you need:

/usr/lib/hadoop/bin/hadoop classpath --glob

For the KRB5CCNAME, in theory it should be sufficient /tmp/krb5cc_$(id -u)

Let me know how it goes!

Weekly update:

We solved the system errors which occur in training models using GPU-labeled nodes in the cluster. Currently, we're running parameter experiments to verify the functionality and compare the results on the CPU and GPU!

Weekly update:
We were able to get similar results across CPU and GPU computation. One major issue is that the worker dispatching weights to the GPU worker (the Parameter Server) is overloaded and saturates the network. we are investigating ways to reduce or re-distribute this load.

Weekly updates:
We solved the problem of the Parameter Server being the bottleneck for computation, for now, by increasing the batch size used for training. In the mean time, we are running several experiments to understand the difference between the Keras Model API vs the Tensorflow Estimator, as prediction accuracy seems to be much lower using the Estimator, i.e. the function which we need to use for large-scale distributed training.

Weekly updates:
We wrote documentation of distributed image inference workflow in the Github repo and provided three tasks as examples: image quality inference, face detection, and Resnet feature extraction. With regard to distributed training using tf-yarn, we are looking for an alternative to wrap a Keras model in Estimator to solve the accuracy issue.

Weekly updates:
Still trying to understand why the estimator performances become so low after moving from Keras to TF.Estimator. Investigation is ongoing and we are getting to the bottom of it.
There are different variables we are looking at:
(1) How the input data is formatted
(2) Whether the model is pre-trained or not
(3) The function used to transform the Keras model to an Estimator
Getting there!

Weekly updates:
We confirmed (1) How the input data is formatted and (3) The function used to transform the Keras model to an Estimator are not the cause of the poor performance for Estimator, as we trained a CNN model from scratch that can reach the same performance in both Keras and Estimator.

Now we are looking into whether the model is pre-trained that causes the poor performance and how to solve it. Also, we want to make sure the model learns similar things when training with Keras and Estimator.

Weekly updates:

We found out the low performance of TF.Estimator was because it didn't load pretrained weights properly. We solved the problem by using a snapshot of untrained model with pretrained weights as warmstart. We were able to get similar results with Keras and TF.Estimator. We are wrapping up things and writing documentation 😃

Weekly updates:

While wrapping up documentation, we are doing a couple of final tests:

  • measuring performance in terms of time/accuracy of GPU on cluster vs GPU on stat machine
  • concurrent training on GPUs on cluster

The first one gives very promising results, the second one will probably be listed as a limitation of the current system: due to the yarn labeling schema, having more than one job running on the same GPU-node seems hard.

Weekly update:

We refactored and parameterized the code so that it is easier for others to use the pipeline to train their own image classifiers. The code has been uploaded to a Github repository, and a usage guide is provided. More documentation about our previous experiments will be uploaded soon.

An implementation of this could be useful inside the MachineVision extension, to supplement API-based image tagging with a homebrewed model specifically for Commons :)