Page MenuHomePhabricator

Release image data for training
Closed, ResolvedPublic

Assigned To
Authored By
Miriam
Mar 23 2021, 9:39 AM
Referenced Files
F34462954: Screenshot from 2021-05-21 18-09-21.png
May 21 2021, 5:10 PM
F34445303: image.png
May 7 2021, 4:09 PM
F34445302: image (1).png
May 7 2021, 4:09 PM
Tokens
"Love" token, awarded by leila.

Description

As part of the Wikipedia Image Captioning competition, we want to release image files for training (and, later on, for testing). To do so, we will need to go through the following steps:

  • Identify the subset of images we want to release from WIT:
  • exclude images with large faces
  • exclude images that are candidate for deletion
  • potentially harmful images
  • Pull data from SWIFT to Hdfs:
  • which resolution?
  • which metadata?
  • Get clearance for data release
  • Can we release embedding as well?
  • Release it for public usage
  • where? analytics dumps?

Event Timeline

Weekly updates:

  • Generated the full list of images to download on HDFS from the list of captioned images on the WIT dataset. From this, we will remove the ones that we shouldn't share according to the security review. Also did some geographic/topical analysis of the training data.
  • Started the security review on ASANA.

Update on the competition images dataset:

  • downloaded 300px thumbnails from swift
  • total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
  • 6711755 images
  • 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
i = spark.read.format('avro').load('images/competition/all/pixels/*').cache()
e = spark.read.csv('images/competition/all/swift_errors/*',sep='\t')

# %%
i.printSchema()...
# root
#  |-- i: integer (nullable = true)
#  |-- image_url: string (nullable = true)
#  |-- project: string (nullable = true)
#  |-- image_file_name: string (nullable = true)
#  |-- thumbnail_size: string (nullable = true)
#  |-- image: struct (nullable = true)
#  |    |-- image_bytes_b64: string (nullable = true)
#  |    |-- format: string (nullable = true)
#  |    |-- width: integer (nullable = true)
#  |    |-- height: integer (nullable = true)
#  |    |-- image_bytes_sha1: string (nullable = true)
#  |    |-- error: string (nullable = true)

i.count()
# 6711755

i.agg(F.mean('image.height')).show()
# +-----------------+
# |avg(image.height)|
# +-----------------+
# |275.5353631265522|
# +-----------------+

e.count()
# 32200

e.count()/i.count()
# 0.004797552950010839

group_error = F.udf(lambda ex: ex[:3],'string')
e.groupBy(group_error('_c5')).count().orderBy('count',ascending=False).show()
# +-------------+-----+
# |<lambda>(_c5)|count|
# +-------------+-----+
# |          404|32135|
# |          429|   64|
# |          HTT|    1|
# +-------------+-----+

Weekly updates:

Weekly updates:

  • Risk assessment with Security team completed
  • Tested ResNet feature extraction at scale, so that we can release image embeddings together with pixel data.

Weekly updates:

  • Computed image detection at scale using the RetinaFace classifier
  • Computed features at scale from the second to last layer of the ResNet50 network trained on ImageNet

Weekly updates:

  • Data release details are sorted, figuring out the last details on the type of features we want to release.
  • For baselines, we are finetuning a model on the WIT dataset, and we will probably release it as baseline, with relative embeddings
  • Working on putting together the workshop proposal for NeurIPS 2021.

Weekly updates:

  • @tizianopiccardi computed the size of the image dataset based on face size. The idea is to remove all images where there is a face as primary subject. It looks like, even with a conservative approach of removing all images where the face is larger than 5% of the total image area, we can retain about 90% of the original image dataset

image.png (317×558 px, 21 KB)

image (1).png (317×558 px, 19 KB)

  • Also, only about 4k images from the 7M in the WIT dataset are candidate for deletion on Commons. We will remove those as well.

Weekly updates:
As part of our conversations with Kaggle and the rest of the org team, we have figured out the schema for data deliverable. We should be able to release our part next week.

Screenshot from 2021-05-21 18-09-21.png (393×831 px, 59 KB)

Weekly updates: discussions on the data nature and structure are still ongoing.

Weekly updates: the dataset release to-dos are listed in this doc. Fabian and Tiziano will work on releasing the image pixels and the embeddings, together with image metadata and license url by the end of next week.