Onboarding Project:
[x] Create a Meta Page
[x] Define the exploratory image domain: monuments
[x] Create dataset of items and image candidates
[x] Build a Relevance Model
[x] Build a Quality model
Q3 goal:
For defined categories (monuments, people):1) Retrieve lists of wikidata items T184734
- **A - get classes identifier2) Retrieve candidate images for wikidata** items T184737
[] A1 - [SPARQL] sparql query to get subclasses of the category of interest (e.g. humans, monuments)
[] A2 - find special identifiers for a given category (e.g. 'designated heritage' for monuments)3) Filter candidate images by relevance T184738
- **B - retrieve wikidata items instances of classes identified in (A)**4) Build a model for image quality T184739
[] B1 - [PYTHON] from the dumps, retain IDs and properties of wikidata items instances of classes in (A):
--- B11 - id
--- B12 - description5) Filter candidate images by quality T184740
--- B13 - labels
--- B14 - location
- **C - retrive images from pages linked to wikidata items in (B)**
[] C1 - [SQL] retrieve pages linked to wikidata items in (B)
[] C2 - [SQL] retrieve images in pages identified in (C1)
[] C3 - [SQL] retrieve page images in pages identified in (C1)
[] C4 - retain properties:
--- C41 - image name
--- C42 - image ID
--- C43 - image description
- **D - retrieve images from commons returned from querying with labels of (B)**
[] D1 - [PYTHON] use commons api to retrieve images from query = label (B13) + location (B14)
[] D2 - [SQL] retain properties:
--- D21 - image name
--- D22 - image ID
--- D23 - image description
- **E - (retrieve flickr images returned from querying with labels of (B) )**
- ** F - build a quality model**
[] F1 - [PYTHON/Tensorflow] build a deep learning model based on quality images
[] F2 - [PYTHON] extend model with features including
--- F21 - size
--- F22 - compression quality
--- F23 - text features (richness, readibility)
- ** G - select candidates in C-E by relevance**
[] G1 - [PYTHON] retain all images (C2) and page images (C3)
[] G2 - [PYTHON] retain only those images retrieved from commons (D) whose image name (D1) soft matches the wikidata label (B13)
[] G3 - [PYTHON] filter out images according to additional computer vision tools
--- G31 - face detector for instances of humans
--- G32 - scene/object detectors for other instances
- **H - sort images in (G) by quality:**
[] H1 - [BASH] download images filtered in (G)
[] H2 - [PYTHON] assign quality score from (F) and rank
- ** I - evaluate:**
[] I1 - WikiShootMe
[] I2 - Wikidata game
[] I3 - Others?6) Evaluate
Obervations - bottlenecks:
- A1- 1 needs domain expertise for e.g. monuments
- B1 and distribute process
- F1/G31 there is no GPU? for reading dumps
- H1- 4 speed-up with GPU, also pixels are not available internally