Page MenuHomePhabricator

Create a Section for Numerically Sequencing Images on Index ns
Open, Needs TriagePublic

Description

Currently, when attempting to create an Index from individual images, the user replaces the <pagelist /> tag in the Pages section of the Index ns with the list of images. This has the following negative consequences

  1. Breaks the usages of <pages /> for transcluding
  2. Does not allow for the usage of a number to set the "Cover image" on Index ns. Instead, [[File:Cover Page.ext|thumb]] has to be used.
  3. Does not allow for the renumbering of pages in "Pages" on Index ns.
  4. Leads to the creation of Page:Image Name instead of Pages:Index/Image Sequence

The key root seems to be that while DJVU and PDF files have a numerical index for all the images in them, individual image files do not. Therefore, the solution would be to create this numerical index.

Here is a proposed solution:

  1. On the Index ns, in the Scans section, instead of "jpg" "gif" "tiff" or "png," there would be a single option "Images"
  2. The selection of "Images" would trigger the activation of a new section called "Image List"
  3. In the "Image List" the user would input the list of images and their corresponding numerical sequence, e.g.

[[Lippincotts Monthly Magazine_45_mdp.39015086882035_001.jpg|1]]
...
[[Lippincotts Monthly Magazine_45_mdp.39015086882035_958.jpg|958]]
3 V2) Alternatively, ProofPage could treat this as an ordered list, load it into an array, and use a loop to automatically number the list.
Lippincotts Monthly Magazine_45_mdp.39015086882035_001.jpg
...
Lippincotts Monthly Magazine_45_mdp.39015086882035_958.jpg

  1. Proofread Page would then feed this numerical sequence into all the other components that normally use the page sequence in built into PDF and DJVU.

Event Timeline

Update, while reading the documentation for Proofread page (1). I discovered that you can transclude an image sequence using the format <pages index="Index Name" from="Start Image" to="End Image"/> even if the images are in a different format. This implies that at some point, the code does numerically sequence the image files. However, on the Index ns <pagelist /> does not work. This implies that the Index ns is not aware or make usage of the numerical sequence that exists.

Therefore, it seems that a code refactoring is needed to make Index ns aware of and leverage the numerical sequence that is created for an image-based Index ns.

  1. https://wikisource.org/wiki/Wikisource:ProofreadPage

The basic issue appears to be that the variable page is never set for images and this is cascading downwards.

Update, while reading the documentation for Proofread page (1). I discovered that you can transclude an image sequence using the format <pages index="Index Name" from="Start Image" to="End Image"/> even if the images are in a different format. This implies that at some point, the code does numerically sequence the image files. However, on the Index ns <pagelist /> does not work. This implies that the Index ns is not aware or make usage of the numerical sequence that exists.

The page sequence is manually specified in the Pages list on the Index page. If you tried to put <pagelist /> there it wouldn't work, because that's where the page list itself is defined. The from/to thing just looks up the pages in that list. If you change the Pages list order or content, the output of <pages /> should change too.

What do you mean when you say that you can't renumber pages in "Pages"? Do you mean using roman numerals or whatever else is in the original source as the page number? I'm pretty sure that should work.

It may be possible to parse the "Pages" list and use it to create Page namespace pages like Page:File Name/Number, just like for PDF and DJVU. I'm not sure offhand how much work that would be.

The issue seems to be that if you create an Index from individual images many of the scripts and gadgets break because of image based Indexes are handled different from PDF/DJVU. A simple task such as numbering pages becomes far more difficult.

For example, https://en.wikisource.org/wiki/Index:Lippincotts_Monthly_Magazine_51 has 1042 pages that you would need to number individually instead of being to set the numbering using page ranges.

To fix these issues either requires making sure than an Index created from images is either functionally the same as one created from PDF/DJVU or writing special code everywhere. In addition to what I mentioned above, these are also broken:

https://en.wikisource.org/wiki/User:Inductiveload/index_preview.js
book2scroll
Phe Merge and Split.
The creation of a link to the actual page in the transcluded page.

The more I look the more it seems as if the Page variable is not set for image based index. While PDF/DJVU have a numerical sequence to order the images stored in them, image do not. (If they did, it would all be 1, 1, etc., except for multipage TIFFs). Perhaps, the solution may be as simple as running a loop to set this variable.

P

For example, https://en.wikisource.org/wiki/Index:Lippincotts_Monthly_Magazine_51 has 1042 pages that you would need to number individually instead of being to set the numbering using page ranges.

Do you mean the page ranges that <pagelist /> lets you use for roman and arabic numerals?

The more I look the more it seems as if the Page variable is not set for image based index. While PDF/DJVU have a numerical sequence to order the images stored in them, image do not. (If they did, it would all be 1, 1, etc., except for multipage TIFFs). Perhaps, the solution may be as simple as running a loop to set this variable.

The fundamental issue is that there is no defined, ordered set of images for the source outside of the "Pages" list in the index. What would you loop over? You would need to have a way to define the set of images -- for example, the base filename and where the numbers go, so you could say that the pattern is "Lippincotts Monthly Magazine_51_uiug.30112108101111_????.jpg" and the numbers range from 1 to 1042. It would be possible to add support to ProofreadPage for this but it would take some thought for how to design it, and it's probably only worth doing if we're sure there's a need for it. In the case of the sources you mentioned, another solution of course is just to convert them to PDF -- is that not viable for some reason?

There are multiple issues with PDF.

  1. Loss of image quality due to overzealous compression. At times, this can make texts unread. It can also reduces the quality of OCR. Lastly, it means that users have to download the original image from an external website to crop illustrations.
  1. Lossless compression of JPEG file increases significantly their size. I’ve seen 1gb of JPEG transform into 2.7gb in a PDF. Chunked Uploader does not handle files over 2gb.
  1. Increases loading time on Pages ns that is proportional to the PDF size.
  1. Adds unnecessary layer of complexity because we go from images -> PDF -> image on the Page ns. In other words, Proofread Page has to extract the image from the PDF to render it.
  1. The IA upload tool has trouble with files with over 600 pages requiring manual uploading.

I think the easiest solution would be to treat the file list as an ordered list. The loop would break once the end of the list is reached.

File 1 = Page 1
...
File 300 = Page 300
...
File N = Page N

Once this is set, then we can treat an image based Index in the same way as a PDF based index.

Numbering the file list is better than trying to parse the list of names because it sometimes necessary to patch in a replacement or missing image, so you can theoretically have:

Source 1 page 1
...
Source 2 page 17
...
Missing
....
Source 3 page 27
..,
Source 1 page n

Other times, it makes sense to remove images (Google covers, microfilm test patterns, etc.)
File 2 = Page 1
...
File 94 = Page 93
File 97 = Page 94
...

Thanks for explaining the issues with PDF. I expected it would be something like that but it's good to see it all spelled out.

Logically I think there is only really one good way to solve this, which is creating a new Index field called "Images" which would have a list like:

[[File:Lippincotts Monthly Magazine_51_uiug.30112108101111_0001.jpg]]
[[File:Lippincotts Monthly Magazine_51_uiug.30112108101111_0002.jpg]]
[[File:Lippincotts Monthly Magazine_51_uiug.30112108101111_0003.jpg]]

These could then be referenced from 1 to N from <pagelist /> and <pages>. Does that seem like it could work?

That sounds like a great approach! Probably, to make life a bit easier for users, we should add "[[File: " and "]]" in the code

So the user would enter in the new Index field called "Images":

Lippincotts Monthly Magazine_51_uiug.30112108101111_0001.jpg
Lippincotts Monthly Magazine_51_uiug.30112108101111_0002.jpg
Lippincotts Monthly Magazine_51_uiug.30112108101111_0003.jpg

I'm not sure if that would work because that would require parsing the text in a new format that we would have to define -- for example, newlines might be significant in this format whereas they weren't previously. There are probably also multiple possible ways to resolve the file name -- could be on another wiki, for example. I think it makes more sense to just use the existing wikitext facility for making links. Someone could make a tool to generate the wikitext from a list like you gave, though.

Ok, I see your point. That was the most minor of quibbles. Your original proposal is great. Hope to see this happen.

There are multiple issues with PDF.

There certainly are, but when you struggle with generating your own PDF files in this way you could also ask the community for help since several of us have tools and experience generating DjVu files from scan images and with the issues involved. As an alternative to insisting on using image-based indexes in contravention of community practice, I mean.

Image based indexes were never particularly well supported, and supporting them well will require a lot more than a half-baked extra field that <pagelist /> can pull from.

It is also highly unlikely that such a change would somehow magically make third-party tools start working with image-based indexes (Match & Split, for example, definitely won't), and conversely there is nothing preventing such tools from working with image-based indexes today (I happen to know that the developer of "index preview.js" is both active and responsive).

I don't think that things will magically begin to work. However, there are multiple cases where an Image based Index would make sense: the Balinese Leaf Project or any index that is suffering from display issues. See, the following Phabricator tickets T224355 T256848 T257025 T184867 . Also, see https://en.wikisource.org/wiki/User:Inductiveload/jump_to_file

I know that we can create a DJVU based index, but DJVU is a dying format. That is why the community has requested support for JP2 T161934 .

This ticket is a start to a longer work to properly supporting image based Indexes. If you have any ideas, I would greatly appreciate your feedback.

@Languageseeker there's also a side-issue that while it's very nice to have the images on hand in their fullest glory possible until T161934 is possible, it's a real pain to actually manage those files because:

  • If you want to move them, it's a bot job touching hundreds of files
  • If you want to update them or their file info, ditto
  • If you want to view them offline (e.g. doing a pagelist using an external viewer or otherwise slinging the data about) you need to download hundreds of files (and that's a lot of data too, possibly >1GB, not an issue for Commons as much as user bandwidth and storage)

These make general maintenance such a PITA that I'm not sure we should recommend using images for basis of Index pages in general just because of some bugs.

There is also nowhere to store a text layer (the WS-OCR project may or may not help here), and I'm not sure how you could point at a file sequence like that using Wikidata's "document file on Wikimedia Commons": https://www.wikidata.org/wiki/Property:P996

What we need is a container format that can encapsulate n images in a defined sequence along with OCR and perhaps other metadata. But! We have that, it's called DjVu and PDF (and multi-page TIFF for that matter). Most of the problems you cite with document formats are:

  • Slow PDF decoding for IA PDFs in particular due to the JP2 compression of the MRC masks
  • Badly implemented PDF encoding that inflates the file size
  • Mediawiki bugs with the detected image size within PDFs specifically causing low quality thumbnails
  • IA-upload limitations, which could probably be addressed
  • Heavy compression, which is not the fault of any format - it's just that the default IA PDFs (and DJVUs) are compressed to death upstream.

@Inductiveload While I agree that individual files make managing the files a real pain, it's probably the only way to do it. I proposed allowing Commons to accept a book scan as one zip in T277921 and AntiCompositeNumber wrote "Nope nope nope, land of 10,000 nopes."

As far as I'm aware, DJVU does not support JP2 files and we should probably begin to slowly phase it out as the development of it has largely come to a halt.

For multi-image Tiff, I don't think there is a lossless way to convert JPEG either.

For Acrobat, the problem is that it can use lossless PNG, but that increases JPEG size. Uploading a 2.7gb PDF and trying to wrangle it with has presented problems. I did a lossless 934mb PDF at https://commons.wikimedia.org/wiki/File:Lippincotts_Monthly_Magazine_45.pdf that we can test lossless PDF support. Also, The Star in the Window.pdf is another example. Perhaps, if PDF is the way that you want to go, then making Chunked Uploader handle files of over 2GB would be the best solution. However, I would still question the wisdom of inflating 1.1gb to 2.7 GB and creating layers of unnecessary encapsulation.

If we do create an Index from single image files, then we can write a script to download it as zip.

PDF can absolutely contain JP2, it's how the IA does it.

They use an MRC compressor with JPX:

$ pdfimages conspiracietrago00chap_0.pdf -f 1 -l 1 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     808  1198  rgb     3   8  jpx    no       756  0   167   167 8234B 0.3%
   1     1 image    2422  3593  rgb     3   8  jpx    no       757  0   501   500 15.0K 0.1%
   1     2 mask     2422  3593  -       1   1  jpx    no       757  0   501   500 15.0K 1.4%

or with JBIG2 masks:

$ pdfimages westernmandarino00grairich.pdf -f 26 -l 26 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  26     0 image     838  1356  gray    1   8  jpx    yes      121  0   167   167 2610B 0.2%
  26     1 image    2514  4068  gray    1   8  jpx    no       122  0   500   501 20.1K 0.2%
  26     2 smask    2514  4068  gray    1   1  jbig2  no       122  0   500   501 24.1K 1.9%

Encoding a compressed JP2 as any lossless format like PNG is insane, as you're slavishly encoding all the pseudo-random compression noise. It's no wonder the images come out so huge. The IA has already compressed those files a lot, there's no benefit in treating them like raw camera data.

You're right about DJVU being a less-favoured format (especially since the IA stopped making them). However, it does has some nice features like good OCR layers support, complete with paragraphs and columns, and excellent random access and decode speeds. On the other hand the JPX/JBIG2 masked IA PDFs are horrible in terms of decode speed. It's really, really bad on devices like an e-reader where 10s/page is how fast you can flip pages, but it's also several seconds on the MW thumbnailer and very roughly 0.5 seconds/page on a dedicated core of my machine.

For your conversion of Lippincot's v45 from Hathi, you can do a lot better:

  • The PNGs are functionally bitonal*, so you can reduce them a fair bit (or JBIG2 if you hate your CPU and the CPU of the decoder)
  • The JPGs can be directly encoded rather than losslessly encoding a JPG as a Flate'd raw image. Using p79 as an example: the original JPG is < 1MB, but your re-encoding of it is 3.5MB.

The total size of the Hathi images is 712MB, so a naive use of img2pdf gives a PDF of the same size - the only increase is the PDF wrapper overhead.

If you first leverage the fact that the PNGs are bitonal, use mogrify -monochrome *.png, then img2pdf produces a PDF of 280MB. Which is still big, but not atrocious considering that it's nearly 1000 images of (mostly) >15MP each and we haven't thrown away a byte of the information from Hathi except in the Google watermarks.

For comparison, a DjVu with c44 for the JPGs and cbj2 for the PNGs comes out at 89MB. c44 is lossy, but cjb2 is lossless by default.

(*) The only non-black/non-white pixels in a Hathi PNG like that is the Google watermark.

For your conversion of Lippincot's v45 from Hathi, you can do a lot better:

Indeed, and if you compile Lippincott into a DjVu with no special tweaking you end up with a 92MB file with no appreciable degradation. This is a Google scan so the "pristine originals" are actually heavily crushed and post-processed. Re-encoding into, for example, DjVu is going to have negligible impact on quality for a dramatic (tenfold) impact on file size.

Especially once you realise that our purpose isn't archival preservation but transcription. We don't need perfect scans, just ones good enough for OCR and manual validation. The reasons for wanting a long term solution for using original scan images are all about reducing complicated steps in the process and making reuse easier, with avoidance of generational loss a distant second.

@Xover and @Inductiveload Thank you both for your feedback and comments. I'm glad that there is a way to reduce the size of the PDF generated from Haithi Trust images. However, I feel like there are two separate issues being discussed.

  1. Should image based index be functionally the same as PDF/DJVU based indexes or should they be handled as special cases?

On this question, I'm of two mind: either we should fix the creation of image based indexes or we should deprecate the code. At this point, the existing code creates more problem that it solves. Nearly every gadget and tool that touches indexes does not work with image based index. It's unreasonable to ask developers to add workarounds to handle the fact that Proofread Page does a terrible job of setting up image based Indexes.

  1. How should works from Haithi Trust be imported into Wikisource. The answer is probably a dedicated toolforge tool for which there should be a separate ticket incorporating your suggestions for how to losslessly reduce the size.