Page MenuHomePhabricator

An End-to-End Image Classification Pipeline
Open, MediumPublic



The Wikimedia Foundation Research team has been working on projects related to computer vision tools, such as prototypes of image classifiers trained on Commons categories. The research project aimed to develop prototypes to evaluate the feasibility of developing in-house computer vision tools to support future platform evolution.

We would now like to expand the work "classifying images based on Commons categories" into our pipeline, so that we can easily play with classified images in different ways. This project would be similar to what Aiko has done during Outreachy internship T233707 but much larger scale.

Project objective

  • Build an end-to-end image classification pipeline based on Commons categories

Possible Workflow

  • Start from a list of image names with labels or Commons categories from which we will infer the labels (e.g. ‘Quality_Images’)
    • define on schema
  • On hadoop: retrieve image bytes by joining image / category list with image bytes table
    • [for later] potentially add more metadata from images
  • Import image bytes on the local machine
  • Train a model using Tensorflow / Keras
    • Finetune an existing model, e.g. Inception V4
    • [for later] Train a model from scratch
  • On hadoop: retrieve the set of images to classify
  • Run inference on images
    • Locally on a stat machine
    • [for later] Distributed on hadoop - issue is how to have Keras loaded by all nodes
  • Save results back to Hive / parquet
    • From a file imported from Local
    • Directly from the distributed job
  • Generate a pipeline for all the above using Kubeflow
  • Think of how to make the data public
    • Discussions with community members

We start with a micro pipeline and then iterate to improve/scale up the pipeline:

  1. Select a small set of images from 2 subcategories in the category
  2. Load the data on the local machine
  3. Finetune a model on Keras to classify images
  4. Run model inference on a sample of random images
  5. Put the results back on hdfs

Event Timeline

Summary of the work done so far:

  • Imported the image data on local and saved to TFRecords files
  • Finetuned an Xception model to classify images between 'sculptures' and 'maiolica'
  • Ran inference on test data on local


  • Trying to run distributed model inference using Keras base on this doc

Hi @AikoChou, thanks for looking into this! Wondering which project tag this task should have, so that others could also spot this task (and be aware of it, as it's sometimes easy in Wikimedia to work on the same things without knowing) when looking at project workboards... any idea? :) Is this SDC General? Does this already touch machine learning? Is T76886: Investigate computer vision image classification and description tools for shadow tags and search descriptions somehow related? Sorry for my cluelessness!

Miriam triaged this task as Medium priority.Mar 4 2021, 9:47 AM
Miriam added projects: Research, MachineVision.
Miriam added subscribers: fkaelin, elukey.

HI @Aklapper we added a few tags, this is mainly at a Research stage for now, so I included it in the Research board, and linked it to the parent epic task for image classification studies. Thanks for the heads up!

Weekly updates:

  • After converting a toy dataset of images into TFRecords, @AikoChou has trained a small model on stat1008 for object classification. We were then able to succesfully run model inference on hadoop.
  • We worked on improving the modeling part: @AikoChou has worked on training and evaluation different models for image quality classification using keras.

Hi @elukey,

I want to use tf-yarn to train a simple model on the cluster, but I found some environment variables need to be set up, which described in this doc:

  • JAVA_HOME: /usr/bin/java
  • HADOOP_HDFS_HOME: /usr/bin/hdfs

I'm not sure what to set for :

  • LD_LIBRARY_PATH(Include the path to and, optionally, the path to I don't find the
  • CLASSPATH (Hadoop jars must be added to the class path)
  • KRB5CCNAME (Path of Kerberos ticket cache if the Hadoop cluster is in secure mode)

Could you help me out with this? Thanks!


So these should be the values that you need:

/usr/lib/hadoop/bin/hadoop classpath --glob

For the KRB5CCNAME, in theory it should be sufficient /tmp/krb5cc_$(id -u)

Let me know how it goes!

Weekly update:

We solved the system errors which occur in training models using GPU-labeled nodes in the cluster. Currently, we're running parameter experiments to verify the functionality and compare the results on the CPU and GPU!

Weekly update:
We were able to get similar results across CPU and GPU computation. One major issue is that the worker dispatching weights to the GPU worker (the Parameter Server) is overloaded and saturates the network. we are investigating ways to reduce or re-distribute this load.