Page MenuHomePhabricator

[SPIKE] Determine general classes of available Commons images
Open, Needs TriagePublic

Description

Analyze all currently available Commons images and compile a list of general classes like place.
The goal is to populate the what does this work contain? question in Upload Wizard's Describe step, see prototype and T358614: UX design for improving describe step in upload wizard.

Tasks

  • look for already available statistics
  • use the wmf_raw.mediawiki_page data lake table for file pages and eventually wmf_raw.mediawiki_categorylinks for categories.
    • map classes/categories to Wikidata items, as a nice to have
  • let a machine-learned model classify all available images

Event Timeline

mfossati changed the task status from Open to In Progress.Mar 8 2024, 2:17 PM
mfossati claimed this task.

Available resources found so far don't seem to help:

Top categories by file count through data lake tables:

from wmfdata.spark import create_session

spark = create_session(app_name='commons-analyzer', type='yarn-large', ship_python_env=True)
snapshot = '2024-02'
q = f"SELECT page_id, page_title FROM wmf_raw.mediawiki_page WHERE snapshot='{snapshot}' AND wiki_db='commonswiki' AND page_namespace=6 AND page_is_redirect=0"
commons = spark.sql(q)
q = f"SELECT cl_from AS page_id, cl_to AS cat_title FROM wmf_raw.mediawiki_categorylinks WHERE snapshot='{snapshot}' AND wiki_db='commonswiki' AND cl_type='file'"
cats = spark.sql(q)
ddf = commons.join(cats, on=['page_id'])
ddf = ddf.groupBy('cat_title').count().orderBy('count', ascending=False)

Totals

From https://commons.wikimedia.org/wiki/Special:MediaStatistics:

  • all files: 104 M (103,972,445)
  • all images: 95 M (95,179,760) - 91.5 %
    • bitmap: 92.7 M (92,681,901) - 89.1 %
    • SVG: 2.5 M (2,497,859) - 2.4 %

Curated categories

GLAM
NOTE: the following categories should hopefully not overlap.
{'cat_title': 'Media_contributed_by_the_Digital_Public_Library_of_America', 'count': 3798743} --> misc library stuff like old postcards, handwritings, ...
{'cat_title': 'Media_contributed_by_the_National_Archives_and_Records_Administration', 'count': 1523908}, --> scans of old b/w picture, illustrations, several from wars 
{'cat_title': 'Files_from_Gallica', 'count': 1445458}, --> book scans
{'cat_title': 'Images_from_the_Rijksdienst_voor_het_Cultureel_Erfgoed', 'count': 486294},
{'cat_title': 'Images_from_the_National_Archives_and_Records_Administration', 'count': 468305},
{'cat_title': 'Media_contributed_by_the_North_Carolina_Digital_Heritage_Center', 'count': 461836},
{'cat_title': 'Media_contributed_by_National_Archives_at_Washington,_DC_-_Textual_Reference', 'count': 439863},
{'cat_title': 'Rijksmonumenten_with_known_IDs', 'count': 432815},
{'cat_title': 'Images_from_Nationaal_Archief', 'count': 417834},
{'cat_title': 'Images_from_Metropolitan_Museum_of_Art', 'count': 388831},
{'cat_title': 'Library_of_Congress-no_known_copyright_restrictions',  'count': 388388},
{'cat_title': 'Media_contributed_by_Abilene_Library_Consortium',  'count': 382345},
{'cat_title': 'Media_contributed_by_Columbus_Metropolitan_Library',  'count': 337246},
{'cat_title': 'Images_from_Paris_Musées', 'count': 326270},
{'cat_title': 'Media_contributed_by_National_Archives_at_College_Park_-_Textual_Reference',  'count': 326237},
{'cat_title': 'Maps_in_the_Library_of_Congress', 'count': 323319},
{'cat_title': 'Images_from_the_New_York_Public_Library', 'count': 318317},
{'cat_title': 'Files_from_the_Historic_American_Buildings_Survey',  'count': 309986},
{'cat_title': 'Files_from_the_Biodiversity_Heritage_Library',  'count': 304090},
{'cat_title': 'Media_donated_by_Naturalis_Biodiversity_Center',  'count': 277678},
{'cat_title': 'Media_contributed_by_Indiana_Memory', 'count': 267105},
{'cat_title': 'Scans_from_the_China_Academic_Digital_Associative_Library',  'count': 224845},
{'cat_title': 'Images_from_the_Swedish_National_Heritage_Board',  'count': 185584},
NOTE: these ones may overlap.
{'cat_title': 'Artworks_with_known_accession_number', 'count': 1916375}, --> scans of illustrations, book pages, drawings, old pictures
{'cat_title': 'Artworks_with_Wikidata_item, 'count': 691731},

Total: 16.4 M (16,443,403) - 17.2 % of all images

Landscape
{'cat_title': 'Images_from_Geograph_Britain_and_Ireland', 'count': 6192018} --> landscapes, monuments like castles & churches
{'cat_title': 'Photos_from_Panoramio', 'count': 2314923} --> similar to first cat

Total: 8.5 M (8,506,941) - 8.9 %

Space
NOTE: all files below seem to be uploaded by Askeuhd and their bot AskeBot.
NOTE: the count is a lower bound, since several other space missions are available. See bot's table.
{'cat_title': 'Images_from_the_Earth_Science_and_Remote_Sensing_Unit,_Lyndon_B._Johnson_Space_Center', 'count': 3271990}, --> satellite pics
{'cat_title': 'ISS_Expedition_53_Crew_Earth_Observations_(dump)', 'count': 421630},
{'cat_title': 'ISS_Expedition_65_Crew_Earth_Observations_(dump)',  'count': 375243},
{'cat_title': 'Earth_at_night_seen_from_space', 'count': 302472},
{'cat_title': 'Daytime_Earth_viewed_from_space', 'count': 300470},
{'cat_title': 'ISS_Expedition_30_Crew_Earth_Observations_(dump)',  'count': 261039},
{'cat_title': 'ISS_Expedition_67_Crew_Earth_Observations_(dump)',  'count': 253049},
{'cat_title': 'ISS_Expedition_42_Crew_Earth_Observations_(dump)',  'count': 212592},

Total: 5.4 M (5,398,485) - 5.6 %

Monuments
WLM
NOTE: the following categories should hopefully not overlap.
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2012', 'count': 363123},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2013', 'count': 368468},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2014', 'count': 320471},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2015', 'count': 230214},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2016', 'count': 276014},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2017', 'count': 243192},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2018', 'count': 258989},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2019', 'count': 211605},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2020', 'count': 230408},
{'cat_title': 'Images_from_Wiki_Loves_Monuments_2023', 'count': 217110},

Total: 2.7 M (2,719,594) - 2.8 %

Cultural heritage
NOTE: these may overlap with the above ones.
{'cat_title': 'Cultural_heritage_monuments_in_Russia_with_known_IDs',  'count': 330567},
{'cat_title': 'Cultural_heritage_monuments_in_France_with_known_IDs',  'count': 258774},
{'cat_title': 'Cultural_heritage_monuments_in_Italy_with_known_IDs',  'count': 243580},
{'cat_title': 'Cultural_heritage_monuments_in_Spain_with_known_IDs',  'count': 198330},
{'cat_title': 'Cultural_heritage_monuments_in_Austria_with_known_IDs',  'count': 186456},
{'cat_title': 'Buildings_with_addresses', 'count': 746835},

Total: 1.9 M (1,964,542) - 2 %

Grand total: 4.6 M (4,684,136) - 4.8 %

Other
{'cat_title': 'Scans_from_the_Internet_Archive', 'count': 1580935}, --> book scans
{'cat_title': 'Images_uploaded_by_Fæ', 'count': 3632767}, --> misc projects, a lot of book scans
{'cat_title': 'Images_with_watermarks', 'count': 603229},
{'cat_title': 'Uploaded_with_Mobile/Android', 'count': 382066},
{'cat_title': 'Flickr_images_reviewed_by_trusted_users', 'count': 352658},
{'cat_title': 'Flickr_images_uploaded_by_Flickr_upload_bot', 'count': 251280},
{'cat_title': 'Media_lacking_author_information', 'count': 216763},
{'cat_title': 'With_trademark', 'count': 192469}
Ranking
  1. GLAM: 17.2 %
  2. Landscape: 8.9 %
  3. Space: 5.6 %
  4. Monuments: 4.8 %

Pasting the high-level takeaway from the category analysis here for readability:

Ranking
  1. GLAM: 17.2 %
  2. Landscape: 8.9 %
  3. Space: 5.6 %
  4. Monuments: 4.8 %
mfossati moved this task from Doing to Blocked on the Structured-Data-Backlog (Current Work) board.

Moving to blocked: waiting for image downloads to complete before running the third analysis task.

Moving to the backlog: the team agreed that the spike isn't directly actionable now, but we may want to pick it up later to do something with the complete dataset of Commons images.

Outcome of a quick investigation on available pre-trained models that may fit our use case:

  • it seems that pre-training is generally done on standard benchmark datasets, check out this list
  • keras offers models pre-trained on the following datasets:
datasettasks# classesfit
ImageNet-1k [1, 2]image classification1,000
COCOobject detection, segmentation80 (objects) + 91 (stuff)TODO try out a model
SA1Bsegmentationnoneunlikely
  • it may be worth to look for models trained on CIFAR-100, with 100 classes grouped into 20 super-classes
  • need to explore Hugging Face's models

Would it be possible that there would be downloadable dump of the thumbnail images? Even better if they would not be small thumbnails but least 1024px, but anyway jpeg compressed so they would be reasonably sized.

Aklapper changed the task status from In Progress to Open.Apr 11 2025, 10:19 PM

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one year (see T380300). Feel free to set that status again, or rather break down into smaller subtasks.