Page MenuHomePhabricator

Parse year information for the pages from database dumps
Open, Needs TriagePublic

Description

Currently only 3.8M photos of Wikimedia Commons 100M photos have P571 (inception) ie. creation date value in machine readalbe form. There are other properties in addition to P571 too, but it tells anyway that majority of the photos doesn't have date information in structured data form. For this reason we need to compile our own page-year dataset from Category links data.

Step 1 - parse the category names data
Parse year from category-names using Python script. Easiest way would be parse it as stream line by line from stdin and output result to stdout.
List of category names can be downloaded from database backup dumps.

*Todo:*

  • Parse years from category names from category.sql.gz. Simple cases are:
    • "YYYY*in*" (example: "1933_in_cricket")
    • "YYYY-MM-DD" (example: "Wikimeetup_in_Vologda_2009-07-19")

Wanted output format
One category per line, use tab as separator between values (ie. tsv)

cat_id year cat_title
866799 1999 "1999_in_Spain"
878558 2009 "Wikimeetup_in_Vologda_2009-07-19"

Example for getting the data:

curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-category.sql.gz"  |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000
1119,'1347',7,4,3
1120,'1348',8,7,1
1121,'1348_deaths',69,65,4
1122,'1349',19,7,12
...

Accessing data using Quarry: [https://quarry.wmcloud.org/query/92397# 92397]

Step 2 - combine year data to pages
Read links between categories and pages from categorylinks table dump( WARNING it is big file)

*Todo*
Write a python script which will parse the categorlinks data and will combine it with the year information gotten from parsing the category table data in step 1. Easiest way would be parse it as stream line by line from stdin and output result to stdout.

Wanted output format
One page per line, use tab as separator between values (ie. tsv)

page_id year
291 1999
292 1935
293 2004

Example for getting the data:

curl -o - "https://dumps.wikimedia.org/commonswiki/20250320/commonswiki-20250320-categorylinks.sql.gz" |gzip -dc |grep "INSERT INTO" |sed "s/),(/\n/g" |head -n 1000

293,'Interior_of_the_Kyoto_Station_building_(1997)','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,106531043
293,'License_migration_completed','KYOTOSTATION1.JPG\nKYOTOSTATION1.JPG','2025-03-06 13:32:15','Kyotostation1.jpg','uppercase','file',0,4339362
293,'Media_missing_infobox_template','KYOTOSTATION1.JPG','2025-03-06 13:32:23','','uppercase','file',0,5133438
293,'Photographs_taken_on_2004-07-29','KYOTOSTATION1.JPG','2025-03-06 13:32:35','','uppercase','file',0,26307660

Accessing data using Quarry: [https://quarry.wmcloud.org/query/92414 92414]

Step 3
If we would like to do more complex analysis of the data for extracting years data from, it would need to be imported into MariaDB. If one is familiar with it, they can try to make a script which would actually follow the category structure and use it for detecting years instead of just stream parsing. The benefit of this approach is that we could separate years from random numbers which are part of the category names.

Other info:

Event Timeline

Zache updated the task description. (Show Details)
# parse_categories.py
import re
import gzip
import sys

def extract_year(cat_title):
    # Match "YYYY_in_..." or "YYYY-MM-DD" patterns
    match = re.search(r"\b(\d{4})(?:-(\d{2})-(\d{2}))?\b", cat_title)
    if match:
        return match.group(1)
    return None

def process_insert_line(line):
    line = line.strip()
    if not line.startswith("INSERT INTO"):
        return []
    # Split SQL INSERT values
    entries = line.replace("INSERT INTO `category` VALUES ", "").split("),(")
    results = []
    for entry in entries:
        # Extract cat_id, cat_title
        parts = entry.split(",", 2)
        if len(parts) < 3:
            continue
        cat_id = parts[0].strip("'")
        cat_title = parts[2].strip("'")
        year = extract_year(cat_title)
        if year:
            results.append(f"{cat_id}\t{year}\t{cat_title}")
    return results

def main():
    with gzip.open(sys.argv[1], "rt", encoding="utf-8", errors="ignore") as f:
        for line in f:
            results = process_insert_line(line)
            if results:
                print("\n".join(results))

if __name__ == "__main__":
    main()

I was able to achieve the destructured dataset using the code in this file.

head -n 10 categories_with_years.tsv

152	1988	Ana_Maria_Ventura_como_Altisidora_en_\"La_enamorada_del_rey\"_(1988).jpg',NULL
691	2003	Ao_Thong_Nai_Pan,_Koh_Phangan,_Thailand_02-2003.jpg',NULL
906	1892	Aivazovsky, Ivan; Maria; 1892',NULL
907	1848	Aivazovsky, Ivan; Brig Mercury 1848; Russian Museum',NULL
918	1855	Aivazovsky, Ivan; Moonlit Odessa 1855; Kaluga Museum of Fine Arts',NULL
986	2013	Afghanistan_-_Location_Map_(2013)_-_AFG_-_UNOCHA.svg',NULL
1216	2018	Amrumer_Windmühle_(2018).jpg',NULL
1324	1805	1805, Joséphine de Beauharnais',NULL
1354	1817	Cornelis_Springer_(1817-1891),_by_Nicolaas_Pieneman.jpg',NULL
1410	2004	Grimsdalen_Rondane_Dovre_Innlandet_Norway_(2004.06.28).jpg',NULL