Release image data for training
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Miriam
	Mar 23 2021, 9:39 AM

Description

As part of the Wikipedia Image Captioning competition, we want to release image files for training (and, later on, for testing). To do so, we will need to go through the following steps:

Identify the subset of images we want to release from WIT:
exclude images with large faces
exclude images that are candidate for deletion
potentially harmful images
Pull data from SWIFT to Hdfs:
which resolution?
which metadata?
Get clearance for data release
Can we release embedding as well?
Release it for public usage
where? analytics dumps?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Miriam	T260634 Run a computer vision challenge
		Resolved		Miriam	T278217 Release image data for training

Event Timeline

Miriam created this task.Mar 23 2021, 9:39 AM

Miriam updated the task description. (Show Details)Mar 23 2021, 9:47 AM

Miriam updated the task description. (Show Details)Mar 24 2021, 4:30 PM

Miriam updated the task description. (Show Details)Mar 26 2021, 10:40 AM

Weekly updates:

Generated the full list of images to download on HDFS from the list of captioned images on the WIT dataset. From this, we will remove the ones that we shouldn't share according to the security review. Also did some geographic/topical analysis of the training data.
Started the security review on ASANA.

Update on the competition images dataset:

downloaded 300px thumbnails from swift
total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
6711755 images
32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/

i = spark.read.format('avro').load('images/competition/all/pixels/*').cache()
e = spark.read.csv('images/competition/all/swift_errors/*',sep='\t')

# %%
i.printSchema()...
# root
#  |-- i: integer (nullable = true)
#  |-- image_url: string (nullable = true)
#  |-- project: string (nullable = true)
#  |-- image_file_name: string (nullable = true)
#  |-- thumbnail_size: string (nullable = true)
#  |-- image: struct (nullable = true)
#  |    |-- image_bytes_b64: string (nullable = true)
#  |    |-- format: string (nullable = true)
#  |    |-- width: integer (nullable = true)
#  |    |-- height: integer (nullable = true)
#  |    |-- image_bytes_sha1: string (nullable = true)
#  |    |-- error: string (nullable = true)

i.count()
# 6711755

i.agg(F.mean('image.height')).show()
# +-----------------+
# |avg(image.height)|
# +-----------------+
# |275.5353631265522|
# +-----------------+

e.count()
# 32200

e.count()/i.count()
# 0.004797552950010839

group_error = F.udf(lambda ex: ex[:3],'string')
e.groupBy(group_error('_c5')).count().orderBy('count',ascending=False).show()
# +-------------+-----+
# |<lambda>(_c5)|count|
# +-------------+-----+
# |          404|32135|
# |          429|   64|
# |          HTT|    1|
# +-------------+-----+

Weekly updates:

Images are available on HDFS
Working on face detection and ResNet feature extraction at scale (@Aiko)
Identified the category and subcategories that identify images candidate for deletion: https://commons.wikimedia.org/wiki/Category:Deletion_requests

AikoChou subscribed.Apr 16 2021, 5:03 AM

Weekly updates:

Risk assessment with Security team completed
Tested ResNet feature extraction at scale, so that we can release image embeddings together with pixel data.

Weekly updates:

Computed image detection at scale using the RetinaFace classifier
Computed features at scale from the second to last layer of the ResNet50 network trained on ImageNet

leila awarded a token.Apr 23 2021, 7:00 PM

leila moved this task from FY2020-21-Research-January-March to FY2020-21-Research-April-June on the Research board.Apr 29 2021, 1:01 AM

leila edited projects, added Research (FY2020-21-Research-April-June); removed Research (FY2020-21-Research-January-March).

Weekly updates:

Data release details are sorted, figuring out the last details on the type of features we want to release.
For baselines, we are finetuning a model on the WIT dataset, and we will probably release it as baseline, with relative embeddings
Working on putting together the workshop proposal for NeurIPS 2021.

Weekly updates:

@tizianopiccardi computed the size of the image dataset based on face size. The idea is to remove all images where there is a face as primary subject. It looks like, even with a conservative approach of removing all images where the face is larger than 5% of the total image area, we can retain about 90% of the original image dataset

Also, only about 4k images from the 7M in the WIT dataset are candidate for deletion on Commons. We will remove those as well.

Miriam mentioned this in T260634: Run a computer vision challenge.May 7 2021, 4:10 PM

Weekly updates:

No updates

Weekly updates:
As part of our conversations with Kaggle and the rest of the org team, we have figured out the schema for data deliverable. We should be able to release our part next week.

Screenshot from 2021-05-21 18-09-21.png (393×831 px, 59 KB)

Weekly updates: discussions on the data nature and structure are still ongoing.

Weekly updates: the dataset release to-dos are listed in this doc. Fabian and Tiziano will work on releasing the image pixels and the embeddings, together with image metadata and license url by the end of next week.

Miriam closed this task as Resolved.Oct 13 2021, 1:40 PM

	F34462954: Screenshot from 2021-05-21 18-09-21.png
	May 21 2021, 5:10 PM

	F34445303: image.png
	May 7 2021, 4:09 PM

	F34445302: image (1).png
	May 7 2021, 4:09 PM

Release image data for trainingClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Release image data for training
Closed, ResolvedPublic
Actions

Related Objects
Search...