Page MenuHomePhabricator

Upload/import wizard for Wikisource works
Open, Needs TriagePublic

Description

This task tracks a wish from the 2016 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Wikisource#Upload_Wikisource_text_wizard

Original proposal

Upload Wikisource text wizard:
Problem: The text upload process is complex across many projects.
Who would benefit: Uploaders
Proposed solution: Create a wizard that includes the upload text process - search Internet Archive, use IA uploader to commons, set index at Wikisource to match Commons, adjust 'page offset' on index page.
Phabricator tickets: T49561: Book upload customization — to integrate book-specific features into UploadWizard
Proposer: Slowking4 02:46, 8 November 2016 (UTC)

Background

A common process for getting a work into Wikisource and ready for proofreading is as follows:

  1. Find the scan on the Internet Archive and download the DjVu version (which is no longer created by IA but still exists for older items)
  2. Upload the DjVu to Commons and populate the {{book}} template with the metadata
  3. Create a matching Index page on Wikisource and populate it with the metadata
  4. Set up the 'pagelist', which is a mapping of scan page numbers (i.e. starting from 1 for the cover of the book) to book page numbers (i.e. which can include independently-numbered sections such as frontmatter, and un-numbered pages)
  5. Create a Wikidata item, again with all the above metadata as well as: Wikisource index page (P1957), Internet Archive ID (P724), and scanned file on Wikimedia Commons (P996)

Possible solutions

  • Extend the UploadWizard extension to specifically handle books (this is what T49561 is about, and was the subject of a GSoC project)
  • Create a new MediaWiki extension
  • Extend the Book Uploader Bot tool (c.f. T59813 which is about the creation of Bub, and still open)
  • Create a new tool

Requirements

Starting from one of:

  1. a set of scan files,
  2. PDF file,
  3. DjVu file,
  4. Internet Archive identifier, or
  5. other online library identifier (maybe Bub negates this?)

we want to end up with:

  • the original files uploaded to Commons (each with {{book}}) and the Internet Archive (from where there will be a link back to the Commons category, in a review if we're unable to edit the original item)
  • a generated DjVu file on Commons, also with {{book}}
  • a category on Commons for the above
  • a Wikisource Index page (as an index to the DjVu on Commons)
  • the pagelist on the Index page
  • a Wikidata item linking to all of the above

Of course, many works will start with some of these resources in place already (e.g. we don't need to generate a DjVu file if it already exists on IA; or a Wikidata item exists and links to the IA but there's nothing on Commons or Wikisource yet) so the system needs to be able to work with partially-imported works.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Just a small list of "it can be done" ideas.

  1. to add to Book template a field, to post the contents of a good pagelist tag (both to be imnported into Index page, and to have the very important table djvu page vs book page into Commons tool).
  1. to standardize the content of a basic Summary field for Index page, with minimal data needed to build ns0 pages and subpages, and to allow to post these data too into Commons book template;
  1. to run a comfortable. customable postOCR list of replacements into djvu text layer just before djvu upload into Commons (see a draft script here)
  1. to build a hOCR->dsed conversion tool so allowing to mount tesseract output into djvu files (personal WIP)
Nemo_bis subscribed.

A kind of "book upload wizard" is BHL's Macaw, which uploads to Internet Archive and produces records/viewers like https://www.biodiversitylibrary.org/item/137363

See docs linked from https://github.com/gbhl/macaw-book-metadata-tool