Page MenuHomePhabricator

Utilize ChatGPT for categorizing and extracting metadata from files on Commons
Closed, ResolvedPublic

Description

Hi!

I am attempting to utilizing ChatGPT for categorizing and extracting unstructured metadata from filenames and description texts of files uploaded to Commons by PencakeBot.

It has shown promising results in facilitating the categorization and metadata extraction processes, although the accuracy may not be perfect at all times. Given that I am exclusively processing files uploaded by PencakeBot, I believe that this approach is acceptable and preferable to the time-consuming task of manually composing complicated extraction rules and RegExps, which are also prone to inaccuracies. Below, I will outline some specific use cases:

  • a. Extract book name from file name (for categorizing):
    • Input: 昌黎先生集四十卷外集十卷遺文一卷 (Mr. Changli's Collected Works, Forty Volumes, with an Additional Ten Volumes of Posthumous Writings.)
    • Output: 昌黎先生集 (Mr. Changli's Collected Works)
    • Input: 永嘉集內編四十八卷外編二十六卷 (Yongjia Collection, Compiled in Forty-Eight Volumes, with an Additional Twenty-Six Volumes of Supplementary Material.)
    • Output: 永嘉集 (Yongjia Collection)
    • Input: 皇朝經世文三編八十卷 (The Imperial Dynasty's Monumental Works in Three Sections, Eighty Volumes.)
    • Output: 皇朝經世文 (The Imperial Dynasty's Monumental Works)
  • b. Sort filename per semantic order (for next/prev navibar):
    • Input:
"诚意伯卷之十六":Volume Sixteen of the Sincere Duke's Collection.
"诚意伯文集卷二":Sincere Duke's Collected Works Vol.2.
"诚意伯文集卷六":Sincere Duke's Collected Works Vol.6.
"诚意伯文集卷四":Sincere Duke's Collected Works Vol.4.
"诚意伯文集卷一":Sincere Duke's Collected Works Vol.1.
"诚意伯文集卷之九":Volume Nine of the Sincere Duke's Collected Works.
"诚意伯文集卷之三十":Volume Thirty of the Sincere Duke's Collected Works.
"诚意伯文集卷之十八":Volume Eighteen of the Sincere Duke's Collected Works.
  • Output: [4, 1, 3, 2, 5, 0, 7, 6]
  • c. Extract author names from byline
    • Input: (梁)蕭統撰;(唐)李善注;(淸)何焯評; (淸)葉樹蕃訂撰 ((Liang) Xiao Tong, authored; (Tang) Li Shan, annotated; (Qing) He Zhaoping, reviewed; (Qing) Ye Shufan, edited. )
    • Output:
[
    {"author": "蕭統", "dynasty": "梁"},
    {"author": "李善", "dynasty": "唐"},
    {"author": "何焯評", "dynasty": "清"},
    {"author": "葉樹蕃", "dynasty": "清"}
]

While I am committed to contributing to Wikimedia projects, I am not in a position to personally cover the expenses incurred by using ChatGPT or other Language Model APIs. Therefore, I am kindly requesting whether it would be possible for the foundation to offer free access to ChatGPT or other Language Model APIs, similar to the accessibility of Toolforge and other cloud services.

I am open to discussing the specifics of use cases further and providing any additional information that may be required to evaluate its feasibility.

Thanks.

Event Timeline

Reedy renamed this task from Utilize ChatGPT for categorizing and extract metadata from files on Commons to Utilize ChatGPT for categorizing and extractinb metadata from files on Commons.Sep 8 2023, 9:18 AM
Reedy updated the task description. (Show Details)
Hoi renamed this task from Utilize ChatGPT for categorizing and extractinb metadata from files on Commons to Utilize ChatGPT for categorizing and extracting metadata from files on Commons.Sep 8 2023, 9:18 AM
Hoi updated the task description. (Show Details)

Hi Hoi! Unfortunately we can't provide free access to ChatGPT, however we are working on hosting large language models on WMF's infrastructure. It will be a few months however.