Onboarding Project:
[x] Create a Meta Page
[x] Define the exploratory image domain: monuments
[x] Create dataset of items and image candidates
[x] Build a Relevance Model
[x] Build a Quality model
Q3 goal:
For defined categories (monuments, people):
[] A - get classes identifiers for wikidata
-A1 - [SPARQL] sparql query to get subclasses of the category of interest (e.g. humans, monuments)
-A2 - find special identifiers for a given category (e.g. 'designated heritage' for monuments)
[] B - retrieve wikidata items instances of classes identified in (A)
-B1 - [PYTHON] from the dumps, retain IDs and properties of wikidata items instances of classes in (A):
--B11 - id
--B12 - description
--B13 - labels
--B14 - location
[] C - retrive images from pages linked to wikidata items in (B)
-C1 - [SQL] retrieve pages linked to wikidata items in (B)
-C2 - [SQL] retrieve images in pages identified in (C1)
-C3 - [SQL] retrieve page images in pages identified in (C1)
-C4 - retain properties:
--C41 - image name
--C42 - image ID
--C43 - image description
[] D - retrieve images from commons returned from querying with labels of (B)
-D1 - [PYTHON] use commons api to retrieve images from query = label (B13) + location (B14)
-D2 - [SQL] retain properties:
--D21 - image name
--D22 - image ID
--D23 - image description
[] E - (retrieve flickr images returned from querying with labels of (B) )
[] F - build a quality model
-F1 - [PYTHON/Tensorflow] build a deep learning model based on quality images
-F2 - [PYTHON] extend model with features including
--F21 - size
--F22 - compression quality
--F23 - text features (richness, readibility)
[] G - select candidates in C-E by relevance
-G1 - [PYTHON] retain all images (C2) and page images (C3)
-G2 - [PYTHON] retain only those images retrieved from commons (D) whose image name (D1) soft matches the wikidata label (B13)
-G3 - [PYTHON] filter out images according to additional computer vision tools
--G31 - face detector for instances of humans
--G32 - scene/object detectors for other instances
[] H - sort images in (G) by quality
-H1 - [BASH] download images filtered in (G)
-H2 - [PYTHON] assign quality score from (F) and rank
[] I - evaluate:
-I1 - WikiShootMe
-I2 - Wikidata game
-I3 - Others?
Obervations - bottlenecks:
-A1 domain expertise for e.g. monuments
-B1 distribute process
-F1/G31 there is no GPU?
-H1 pixels are not available internally