Page MenuHomePhabricator

Create an Importer for Hathi Trust
Open, Needs TriagePublic

Description

Hathi Trust is one of the most important sites for scans. It contains many works not hosted on the IA or other sites. It also hosts higher quality scans of works on Google Books. However, adding a text from Hathi Trust requires multiple steps making it more suitable for a Tool Forge project. These are the steps needed to import from Hathi Trust without an API key.

Three Variables to Input

  1. URL, e.g. https://babel.hathitrust.org/cgi/pt?id=uc1.b3315479&view=1up&seq=11
  2. "Number of Pages", e.g.
  3. PDF Name

Step 1) Create Folder "PDF Name"
Step 2) Extract "Hathi Trust ID" from URL, e.g. uc1.b3315479 . Note the Hathi Trust ID can contain slashes.
Step 3) Download list of files based on URL Name in the pattern

https://babel.hathitrust.org/cgi/imgsrv/image?id="Hathi Trust ID";seq=1;size=10000;rotation=0
....
https://babel.hathitrust.org/cgi/imgsrv/image?id="Hathi Trust ID";seq="Number of Pages";size=10000;rotation=0

Sequentially name the files with 4 digits, eg. 0001 0002 0003
Step 4) Check and retry any failed downloads.
Step 5) create a text file "PDF Name".txt with a list of files in folder "PDF Name"
Step 6) Run tesseract "PDF Name".txt "PDF Name".pdf PDF
Step 7) Upload Resulting PDF to Commons under Username

Event Timeline

Xover renamed this task from Create an Importer for Haithi Trust to Create an Importer for Hathi Trust.May 29 2023, 7:41 AM
Xover updated the task description. (Show Details)