Onboarding Project:
[x] Create a Meta Page
[x] Define the exploratory image domain: monuments
[x] Create dataset of items and image candidates
[x] Build a Relevance Model
[x] Build a Quality model
Q3 goal:
For defined categories (monuments, people):
[] A - get classes identifiers for wikidata
- A1 - [SPARQL] sparql query to get subclasses of the category of interest (e.g. humans, monuments)
- A2 - find special identifiers for a given category (e.g. 'designated heritage' for monuments)
[] B - retrieve wikidata items instances of classes identified in (A)
- B1 - [PYTHON] from the dumps, retain IDs and properties of wikidata items instances of classes in (A):
-- B11 - id
-- B12 - description
-- B13 - labels
-- B14 - location
[] C - retrive images from pages linked to wikidata items in (B)
C1 - [SQL] retrieve pages linked to wikidata items in (B)
C2 - [SQL] retrieve images in pages identified in (C1)
C3 - [SQL] retrieve page images in pages identified in (C1)
C4 - retain properties:
C41 - image name
C42 - image ID
C43 - image description
[] D - retrieve images from commons returned from querying with labels of (B)
D1 - [PYTHON] use commons api to retrieve images from query = label (B13) + location (B14)
D2 - [SQL] retain properties:
D21 - image name
D22 - image ID
D23 - image description
[] E - (retrieve flickr images returned from querying with labels of (B) )
[] F - build a quality model
F1 - [PYTHON/Tensorflow] build a deep learning model based on quality images
F2 - [PYTHON] extend model with features including
F21 - size
F22 - compression quality
F23 - text features (richness, readibility)
[] G - select candidates in C-E by relevance
G1 - [PYTHON] retain all images (C2) and page images (C3)
G2 - [PYTHON] retain only those images retrieved from commons (D) whose image name (D1) soft matches the wikidata label (B13)
G3 - [PYTHON] filter out images according to additional computer vision tools
G31 - face detector for instances of humans
G32 - scene/object detectors for other instances
[] H - sort images in (G) by quality
H1 - [BASH] download images filtered in (G)
H2 - [PYTHON] assign quality score from (F) and rank
[] I - evaluate:
I1 - WikiShootMe
I2 - Wikidata game
I3 - Others?
Obervations - bottlenecks:
A1 domain expertise for e.g. monuments
B1 distribute process
F1/G31 there is no GPU?
H1 pixels are not available internally