Synopsis
The Wikimedia Foundation Research team has been working on projects related to computer vision tools, such as prototypes of image classifiers trained on Commons categories. The research project aimed to develop prototypes to evaluate the feasibility of developing in-house computer vision tools to support future platform evolution.
We would now like to expand the work "classifying images based on Commons categories" into our pipeline, so that we can easily play with classified images in different ways. This project would be similar to what Aiko has done during Outreachy internship T233707 but much larger scale.
Project objective
- Build an end-to-end image classification pipeline based on Commons categories
Possible Workflow
- Start from a list of image names with labels or Commons categories from which we will infer the labels (e.g. ‘Quality_Images’)
- define on schema
- On hadoop: retrieve image bytes by joining image / category list with image bytes table
- [for later] potentially add more metadata from images
- Import image bytes on the local machine
- Which format
- Convert directly to TFRecords
- Train a model using Tensorflow / Keras
- Finetune an existing model, e.g. Inception V4
- [for later] Train a model from scratch
- On hadoop: retrieve the set of images to classify
- Run inference on images
- Locally on a stat machine
- [for later] Distributed on hadoop - issue is how to have Keras loaded by all nodes
- Save results back to Hive / parquet
- From a file imported from Local
- Directly from the distributed job
- Generate a pipeline for all the above using Kubeflow
- Think of how to make the data public
- Discussions with community members
We start with a micro pipeline and then iterate to improve/scale up the pipeline:
- Select a small set of images from 2 subcategories in the category
- Load the data on the local machine
- Finetune a model on Keras to classify images
- Run model inference on a sample of random images
- Put the results back on hdfs