Page MenuHomePhabricator

Tool to bulk retrieve Wikimedia Commons image metadata
Open, Needs TriagePublic

Description

The need

I often get requests from to retrieve metadata and info of Wikimedia Commons images in bulk. It is usually like, we have a list of URLs or file titles, we want description, author, and license information. I usually run a script to that. However, this can be a good tool.

Proposed solution

Create a simple flask app, where users can upload as TSV or CSV of files to upload the tool's UI, and have the option to select various data options such as:

  • Description
  • Creation date
  • Author
  • License information & etc.

The tool will fetch the information will using MediaWiki's API and given an output TSV file for the user to download.

Event Timeline

Thank you for tagging this task with good first task for Wikimedia newcomers!

Newcomers often may not be aware of things that may seem obvious to seasoned contributors, so please take a moment to reflect on how this task might look to somebody who has never contributed to Wikimedia projects.

A good first task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributor, for example it should clearly pointed to the codebase URL and provide clear steps to help a contributor get setup for success. We've included some guidelines at https://phabricator.wikimedia.org/tag/good_first_task/ !

Thank you for helping us drive new contributions to our projects <3

Hey @KCVelaga , @Aklapper I've created an Initial version of this project let me know how can I improve it more so that we can make this tool to use by other people, I'm happy to work and volunteer, thanks

Working link: https://demo.cc.reimg.cfd/
Github Repo: https://github.com/Mr-Sunglasses/CommonsMetaFetch

Here is the final build version. Similar to the initial version, can be expanded to include images in batch sizes for bulk retrieval. Potential issues might be the limiting rate of the api.

Github repo: https://github.com/kavs1123/wikimedia_hack

Why does it fetch metadata from English Wikipedia and not from Wikimedia Commons?

Potential issues might be the limiting rate of the api.

See https://www.mediawiki.org/wiki/API:Ratelimit/Wikimedia_sites and https://www.mediawiki.org/wiki/API:Etiquette#Request_limit

Hey, I have started working on this task. following are the design mockups and userflow for the tool. I will be implementing a message queue with rabbitmq for being able to sustain high requests and also caching with Redis. one query, does the application has to built with flask?

@Aklapper @KCVelaga

Desktop - 1.png (918×1 px, 70 KB)

Desktop - 2.png (918×1 px, 108 KB)

Desktop - 3.png (1×1 px, 78 KB)

Hi, please don't ping for no reason - thanks a lot! :)

I am new to contributing to Wikimedia, I apologize for any inconvenience caused by the ping. I will ensure to ping only when absolutely necessary in the future.

Izno added a subscriber: AyushShukla1807.

We need a new Gerrit repository for a Flask-based tool that extracts metadata
(description, author, license, etc.) in bulk from Wikimedia Commons.

Proposed repo name: labs/tools/wikimedia-commons-metadata-extractor
Purpose: A web tool where users can upload a TSV/CSV of file titles or URLs,
and download enriched metadata as output.

Access: Standard Gerrit permissions, with me (tejashxv) as the initial maintainer.