Page MenuHomePhabricator

Tool to upload from Panjab Digital Library
Open, Needs TriagePublic

Description

Panjab Digital Library has 1791 manuscripts and 8996 books on their website. All the manuscripts are in public domain and many books are also in public domain. Most of the manuscripts and books are in Punjabi language but some of them are in English, Hindi and Persian as well. They have digitized everything in form of images and they are not searchable. They have uploaded images in such a form that it is quite difficult to download them. I think a tool should be created to download all the manuscripts and books which are in Public domain. This will help in developing Punjabi Wikisource as well as Punjab related content on other Wikisources. This will again help in improving other projects as well.

http://www.panjabdigilib.org/webuser/searches/mainpage.jsp

--Satdeep Gill (talk) 07:31, 13 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 0 support votes, and was ranked last out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#Tool_to_upload_from_Panjab_Digital_Library

Event Timeline

DannyH created this task.Dec 8 2015, 12:19 AM
DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH moved this task to Wishlist 51-on on the Community-Wishlist-Survey-2015 board.
DannyH added a subscriber: DannyH.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 8 2015, 12:19 AM

Is anyone working on this?

I am interested in creating a tool for this,

The plan is as below.

  1. Get URL of a book and number of pages from user
  2. Web Scrapper goes through all the pages and downloads the images
  3. Combine all images as a PDF file
  4. Upload to Panjabi WikiSource
DannyH updated the task description. (Show Details)Jan 13 2016, 1:04 AM
DannyH set Security to None.

@Tshrinivasan -- This card is open, go ahead and work on it. I know that it will be much appreciated. :)

IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 21 2016, 2:51 PM

The library could be added to the existing tool BUB.(http://tools.wmflabs.org/bub/) source: https://github.com/rohit-dua/BUB
I can add it to the existing list of libraries. The scraping code should be similar to the one used for google books.

The tool BUB downloads images from the library and then uploads the combined images to archive.org with corresponding metadata. archive.org is used for OCR'ing the book, which could then be easily used with IA-Upload tool to upload to commons.

DannyH updated the task description. (Show Details)Feb 6 2016, 12:35 AM