Currently only 3.8M photos of Wikimedia Commons 100M photos have P571 (inception) ie. creation date value in machine readalbe form. There are other properties in addition to P571 too, but it tells anyway that majority of the photos doesn't have date information in structured data form.
However, majority of the photos which have applicapable date documented in the photos information template have one or more year categories. This year category can be parsed for getting normalized year information for the filtering purposes.
**Step 1 - parse the category names data**
Parse year from category-names using Python script. Easiest way would be parse it as stream line by line from `stdin` and output result to `stdout`.
List of category names can be downloaded from [[ https://dumps.wikimedia.org/backup-index.html | database backup dumps ]].
* https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-category.sql.gz ( 260MB)
*Todo:*
- Parse years from category names from category.sql.gz. Simple cases are:
-- "YYYY*in*" (example: "1933_in_cricket")
-- "YYYY-MM-DD" (example: "Wikimeetup_in_Vologda_2009-07-19")
**Wanted output format**
One category per line, use `tab` as separator between values (ie. `tsv`)
```
cat_id year cat_title
866799 1999 "1999_in_Spain"
878558 2009 "Wikimeetup_in_Vologda_2009-07-19"
```
Example for getting the data:
```
curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-category.sql.gz" |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000
```
```
1119,'1347',7,4,3
1120,'1348',8,7,1
1121,'1348_deaths',69,65,4
1122,'1349',19,7,12
...
```
**Step 2 - combine year data to pages**
Read links between categories and pages from categorylinks table dump( WARNING it is big file)
- https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-categorylinks.sql.gz (13GB)
*Todo*
Write a python script which will parse the categorlinks data and will combine it with the year information gotten from parsing the `category` table data in step 1. Easiest way would be parse it as stream line by line from `stdin` and output result to `stdout`.
**Wanted output format**
One page per line, use `tab` as separator between values (ie. `tsv`)
```
page_id year
291 1999
292 1935
293 2004
```
Example for getting the data:
curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-categorylinks.sql.gz" |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000
```
293,'Interior_of_the_Kyoto_Station_building_(1997)','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,106531043
293,'License_migration_completed','KYOTOSTATION1.JPG\nKYOTOSTATION1.JPG','2025-03-06 13:32:15','Kyotostation1.jpg','uppercase','file',0,4339362
293,'Media_missing_infobox_template','KYOTOSTATION1.JPG','2025-03-06 13:32:23','','uppercase','file',0,5133438
293,'Photographs_taken_on_2004-07-29','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,26307660
```
**Step 3**
If we would like to do more complex analysis of the data for extracting years data from, it would need to be imported into MariaDB. If one is familiar with it, they can try to make a script which would actually follow the [[ https://commons.wikimedia.org/wiki/Category:Categories_by_year | category structure ]] and use it for detecting years instead of just stream parsing. The benefit of this approach is that we could separate years from random numbers which are part of the category names.