Page MenuHomePhabricator

ProofreadPage should support jpg/png as single-page documents
Open, Needs TriagePublic


There is ongoing work on Balinese palm-leaf manuscripts on Wikimedia sites. This started with a project grant to add manuscripts to Wikisource and is expanding with a Wikicite grant to catalog privately held manuscripts in Bali. As part of the Wikicite grant, manuscripts are being added to Wikisource as a single image containing the first and last leaf. See more background here and an example manuscript here.

The reason these manuscripts are being added to Wikisource in partial form is that there are many manuscripts to catalog and it's important to make information about all of the manuscripts available first before proceeding with further digitization. The text content on the first and last leaf is still very useful.

Unfortunately ProofreadPage does not fully support jpg/png format documents. If you put the standard <pagelist /> tag under Pages on the Index page, ProofreadPage returns "Error: file not found" instead of a page list. You can work around this by manually entering wikitext in the Pages field such as (in the above linked example) [[Page:Bali-lontar-Tojan-Candra Graha.jpg|1]]. But this is not ideal because you lose the visual indication of proofread status that <pagelist> provides.

This is a very simple fix code-wise and I'll submit a patch shortly. What I'm less certain of is if it's a good idea or if there are other unintended consequences. Feedback appreciated!

Event Timeline

Change 652408 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/extensions/ProofreadPage@master] Support single-page file formats

Change 652408 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Support single-page file formats

Not a big deal, but I'll note the task description does not provide a use case: what users are currently unable to use multi-page formats? It's trivial to convert a JPG into a PDF (I recommend img2pdf) and doing so makes it easier to upgrade the source document in the future, e.g. by adding OCR (I know, not likely for these specific documents) or images in a different format (e.g. JPEG2000, as IA is doing nowadays).

What's the point of an index for a single-page document? Should we then support conversion from a single page to more than one (e.g. if a scan is found to have been incomplete)?

If it works, fine, let's have one more feature. Let's make sure that users understand their options though.

Joseagush explains the use case in the link to the talk page in the task description. The contributors in Bali are cataloging a large number of manuscripts and converting to PDF is an additional step which is not trivial for them and has resulted in decreased image resolution with their attempts so far. I'm not going to say there is no way to get around this but I also think it's important to make Wikisource accessible to smaller communities like this. They've made a lot of progress already in learning the technical aspects of Wikimedia sites. Just keep in mind that their capacity is still limited, so if they need to learn any new processes (such as img2pdf which would require them to use Python on Windows, use the command line, etc.) then we should be sure it's really needed.

These manuscripts all have more than one page, but in this phase they'll just be cataloging the first and last page. So I think by definition, there can be no pages missing from this first-and-last-page format. But I think it's important for them to consider what will happen in the future if/when someone wants to fully scan one of the manuscripts. It's possible it would have to be done as two separate documents (catalog entry and full document) but perhaps there's a better way.

Re: OCR, that's also a good thing to keep in mind. It doesn't exist for Balinese script currently, but with a large enough corpus of manually transcribed text (such as from these very documents) someone may be able to develop something.