Page MenuHomePhabricator

Release image data for training
Open, Needs TriagePublic

Description

As part of the Wikipedia Image Captioning competition, we want to release image files for training (and, later on, for testing). To do so, we will need to go through the following steps:

  • Identify the subset of images we want to release from WIT:
  • exclude images with large faces
  • exclude images that are candidate for deletion
  • potentially harmful images
  • Pull data from SWIFT to Hdfs:
  • which resolution?
  • which metadata?
  • Get clearance for data release
  • Can we release embedding as well?
  • Release it for public usage
  • where? analytics dumps?

Event Timeline

Weekly updates:

  • Generated the full list of images to download on HDFS from the list of captioned images on the WIT dataset. From this, we will remove the ones that we shouldn't share according to the security review. Also did some geographic/topical analysis of the training data.
  • Started the security review on ASANA.

Update on the competition images dataset:

  • downloaded 300px thumbnails from swift
  • total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
  • 6711755 images
  • 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
i = spark.read.format('avro').load('images/competition/all/pixels/*').cache()
e = spark.read.csv('images/competition/all/swift_errors/*',sep='\t')

# %%
i.printSchema()...
# root
#  |-- i: integer (nullable = true)
#  |-- image_url: string (nullable = true)
#  |-- project: string (nullable = true)
#  |-- image_file_name: string (nullable = true)
#  |-- thumbnail_size: string (nullable = true)
#  |-- image: struct (nullable = true)
#  |    |-- image_bytes_b64: string (nullable = true)
#  |    |-- format: string (nullable = true)
#  |    |-- width: integer (nullable = true)
#  |    |-- height: integer (nullable = true)
#  |    |-- image_bytes_sha1: string (nullable = true)
#  |    |-- error: string (nullable = true)

i.count()
# 6711755

i.agg(F.mean('image.height')).show()
# +-----------------+
# |avg(image.height)|
# +-----------------+
# |275.5353631265522|
# +-----------------+

e.count()
# 32200

e.count()/i.count()
# 0.004797552950010839

group_error = F.udf(lambda ex: ex[:3],'string')
e.groupBy(group_error('_c5')).count().orderBy('count',ascending=False).show()
# +-------------+-----+
# |<lambda>(_c5)|count|
# +-------------+-----+
# |          404|32135|
# |          429|   64|
# |          HTT|    1|
# +-------------+-----+

Weekly updates:

Weekly updates:

  • Risk assessment with Security team completed
  • Tested ResNet feature extraction at scale, so that we can release image embeddings together with pixel data.

Weekly updates:

  • Computed image detection at scale using the RetinaFace classifier
  • Computed features at scale from the second to last layer of the ResNet50 network trained on ImageNet

Weekly updates:

  • Data release details are sorted, figuring out the last details on the type of features we want to release.
  • For baselines, we are finetuning a model on the WIT dataset, and we will probably release it as baseline, with relative embeddings
  • Working on putting together the workshop proposal for NeurIPS 2021.

Weekly updates:

  • @tizianopiccardi computed the size of the image dataset based on face size. The idea is to remove all images where there is a face as primary subject. It looks like, even with a conservative approach of removing all images where the face is larger than 5% of the total image area, we can retain about 90% of the original image dataset


  • Also, only about 4k images from the 7M in the WIT dataset are candidate for deletion on Commons. We will remove those as well.