Currently only 3.8M photos of Wikimedia Commons 100M photos have P571 (inception) ie. creation date value in machine readalbe form. There are other properties in addition to P571 too, but it tells anyway that majority of the photos doesn't have date information in structured data form. For this reason we need to compile our own page-year dataset from Category links data.
Step 1 - parse the category names data
Parse year from category-names using Python script. Easiest way would be parse it as stream line by line from stdin and output result to stdout.
List of category names can be downloaded from database backup dumps.
*Todo:*
- Parse years from category names from category.sql.gz. Simple cases are:
- "YYYY*in*" (example: "1933_in_cricket")
- "YYYY-MM-DD" (example: "Wikimeetup_in_Vologda_2009-07-19")
Wanted output format
One category per line, use tab as separator between values (ie. tsv)
cat_id year cat_title 866799 1999 "1999_in_Spain" 878558 2009 "Wikimeetup_in_Vologda_2009-07-19"
Example for getting the data:
curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-category.sql.gz" |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000
1119,'1347',7,4,3 1120,'1348',8,7,1 1121,'1348_deaths',69,65,4 1122,'1349',19,7,12 ...
Accessing data using Quarry: [https://quarry.wmcloud.org/query/92397# 92397]
Step 2 - combine year data to pages
Read links between categories and pages from categorylinks table dump( WARNING it is big file)
*Todo*
Write a python script which will parse the categorlinks data and will combine it with the year information gotten from parsing the category table data in step 1. Easiest way would be parse it as stream line by line from stdin and output result to stdout.
Wanted output format
One page per line, use tab as separator between values (ie. tsv)
page_id year 291 1999 292 1935 293 2004
Example for getting the data:
curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-categorylinks.sql.gz" |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000
293,'Interior_of_the_Kyoto_Station_building_(1997)','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,106531043 293,'License_migration_completed','KYOTOSTATION1.JPG\nKYOTOSTATION1.JPG','2025-03-06 13:32:15','Kyotostation1.jpg','uppercase','file',0,4339362 293,'Media_missing_infobox_template','KYOTOSTATION1.JPG','2025-03-06 13:32:23','','uppercase','file',0,5133438 293,'Photographs_taken_on_2004-07-29','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,26307660
Accessing data using Quarry: [https://quarry.wmcloud.org/query/92414 92414]
Step 3
If we would like to do more complex analysis of the data for extracting years data from, it would need to be imported into MariaDB. If one is familiar with it, they can try to make a script which would actually follow the category structure and use it for detecting years instead of just stream parsing. The benefit of this approach is that we could separate years from random numbers which are part of the category names.
Other info: