Upload/import wizard for Wikisource works
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Samwilson
	Jan 2 2017, 4:19 AM

Description

This task tracks a wish from the 2016 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Wikisource#Upload_Wikisource_text_wizard

Original proposal

Upload Wikisource text wizard:
Problem: The text upload process is complex across many projects.
Who would benefit: Uploaders
Proposed solution: Create a wizard that includes the upload text process - search Internet Archive, use IA uploader to commons, set index at Wikisource to match Commons, adjust 'page offset' on index page.
Phabricator tickets: T49561: Book upload customization — to integrate book-specific features into UploadWizard
Proposer: Slowking4 02:46, 8 November 2016 (UTC)

Background

A common process for getting a work into Wikisource and ready for proofreading is as follows:

Find the scan on the Internet Archive and download the DjVu version (which is no longer created by IA but still exists for older items)
Upload the DjVu to Commons and populate the {{book}} template with the metadata
Create a matching Index page on Wikisource and populate it with the metadata
Set up the 'pagelist', which is a mapping of scan page numbers (i.e. starting from 1 for the cover of the book) to book page numbers (i.e. which can include independently-numbered sections such as frontmatter, and un-numbered pages)
Create a Wikidata item, again with all the above metadata as well as: Wikisource index page (P1957), Internet Archive ID (P724), and scanned file on Wikimedia Commons (P996)

Possible solutions

Extend the UploadWizard extension to specifically handle books (this is what T49561 is about, and was the subject of a GSoC project)
Create a new MediaWiki extension
Extend the Book Uploader Bot tool (c.f. T59813 which is about the creation of Bub, and still open)
Create a new tool

Requirements

Starting from one of:

a set of scan files,
PDF file,
DjVu file,
Internet Archive identifier, or
other online library identifier (maybe Bub negates this?)

we want to end up with:

the original files uploaded to Commons (each with {{book}}) and the Internet Archive (from where there will be a link back to the Commons category, in a review if we're unable to edit the original item)
a generated DjVu file on Commons, also with {{book}}
a category on Commons for the above
a Wikisource Index page (as an index to the DjVu on Commons)
the pagelist on the Index page
a Wikidata item linking to all of the above

Of course, many works will start with some of these resources in place already (e.g. we don't need to generate a DjVu file if it already exists on IA; or a Wikidata item exists and links to the IA but there's nothing on Commons or Wikisource yet) so the system needs to be able to work with partially-imported works.

Related Objects
Search...

Status	Assigned	Task
Open	None	T154413 Upload/import wizard for Wikisource works
Resolved	None	T172953 Create OOUI widget for Index page pagelists
Declined	Prondubuisi	T248230 GSoC 2020 Proposal : Create OOUI widget for Index page pagelists
Declined	Devsharma21	T248236 GSoC 2020 Proposal for Create OOUI widget for Index page pagelists
Declined	Jayprakash12345	T248245 GSoC 2020 Proposal: Create OOUI widget for Index page pagelists
Resolved	Soda	T247157 [Proposal] Create OOUI widget for Index page pagelists {Google Summer of Code 2020}
Open	None	T253458 Build the Pagelist Input Widget
Declined	arjumand_mirza	T248839 GSOC 2020 : Develop an editing widget for Proofread Page extension
Open	None	T171266 Book Uploader Bot (BUB) queue has stalled

Event Timeline

Samwilson created this task.Jan 2 2017, 4:19 AM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 2 2017, 4:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Bodhisattwa subscribed.Jan 2 2017, 4:33 AM

Samwilson updated the task description. (Show Details)Jan 2 2017, 4:40 AM

Samwilson mentioned this in T59813: Google Books > Internet Archive > Commons upload cycle.

Wesalius subscribed.Jan 2 2017, 7:21 AM

Yann subscribed.Jan 2 2017, 9:30 AM

VIGNERON subscribed.Jan 2 2017, 10:24 AM

Ankry subscribed.Jan 2 2017, 11:47 AM

Aschroet subscribed.Jan 2 2017, 1:48 PM

jayantanth subscribed.Jan 2 2017, 4:39 PM

Samwilson updated the task description. (Show Details)Jan 3 2017, 5:00 AM

Tshrinivasan subscribed.Apr 12 2017, 6:22 PM

Will explore to extend https://tools.wmflabs.org/bub to support IA.

Samwilson added a subtask: T172953: Create OOUI widget for Index page pagelists.Aug 9 2017, 9:24 PM

Liuxinyu970226 subscribed.Dec 16 2017, 2:06 AM

Exilexi subscribed.Dec 21 2017, 8:19 AM

Just a small list of "it can be done" ideas.

to add to Book template a field, to post the contents of a good pagelist tag (both to be imnported into Index page, and to have the very important table djvu page vs book page into Commons tool).

to standardize the content of a basic Summary field for Index page, with minimal data needed to build ns0 pages and subpages, and to allow to post these data too into Commons book template;

to run a comfortable. customable postOCR list of replacements into djvu text layer just before djvu upload into Commons (see a draft script here)

to build a hOCR->dsed conversion tool so allowing to mount tesseract output into djvu files (personal WIP)

wassan.anmol117 subscribed.Apr 4 2018, 5:38 PM

A kind of "book upload wizard" is BHL's Macaw, which uploads to Internet Archive and produces records/viewers like https://www.biodiversitylibrary.org/item/137363

See docs linked from https://github.com/gbhl/macaw-book-metadata-tool

Esc3300 mentioned this in T258546: Show page list based on pagination string stored on Commons (files currently not used on WS).Jul 22 2020, 12:45 AM

SGill closed subtask T172953: Create OOUI widget for Index page pagelists as Resolved.Oct 7 2020, 10:51 AM

Harej moved this task from Backlog to Integrations on the Internet-Archive board.Nov 2 2021, 3:56 PM

Lectrician1 subscribed.Jan 29 2022, 1:58 AM