Page MenuHomePhabricator

Bad text layer extraction from PDFs
Open, Needs TriagePublicBUG REPORT

Description

If a scan in PDF has a text layer, Mediawiki extracts it very poorly. Even a very good text layer is extracted badly. DJVUs do not suffer this problem and their text layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its text layer usually improves too. If the text is copypasted from the PDF document into a word processor, it is good as well. This means that the text layer is good, only Mediawiki cannot get it well from PDFs.

Example of text layer extraction from a PDF here: https://en.wikisource.org/w/index.php?title=Page:The_Hussite_Wars,_by_the_Count_L%C3%BCtzow.pdf/70&action=edit&redlink=1

The same PDF scan was converted into DJVU and the result can be compared here: https://en.wikisource.org/w/index.php?title=Page:The_Hussite_wars,_by_the_Count_L%C3%BCtzow.djvu/70&action=edit&redlink=1

Most libraries including Internet Archive or HathiTrust offer downloading PDFs with text layers and not DJVUs. Besides that handling DJVU is difficult for many contributors, not only for newbies. So, we do need to fix the text layer extraction from PDFs.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Playing with "pdftotext" options, output can be similar to djvu text layer.

e.g. see

$ pdftotext The_Hussite_Wars,_by_the_Count_Lützow.pdf -f 70 -l 70 tmp.txt -layout | cat tmp.txt 
48 THE HUSSITE WARS
passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to
...

vs.

$ pdftotext The_Hussite_Wars,_by_the_Count_Lützow.pdf -f 70 -l 70 tmp.txt | cat tmp.txt 
THE HUSSITE WARS

48

passing safely through a country occupied by Sigismund's
...

vs.

$ djvutxt -page=70 The_Hussite_wars,_by_the_Count_Lützow.djvu
48 
THE HUSSITE WARS 
passing safely through a country occupied by Sigismund's 
troops, they arrived near Kralov6 Hradec. They called to 
...

I was never able to get the same result, but it might depend on which options the program who converted from pdf to djvu has used (@Jan.Kamenicek, what did you use?)

Flags for "pdftotext" are set here:
https://github.com/wikimedia/mediawiki-extensions-PdfHandler/blob/df484dbe704a5d4ee902e38defeba991dbe21fab/includes/PdfImage.php#L152

One option could be to act here, but impact vs. current settings should be evaluated.

@Aklapper , I readded MediaWiki-extensions-PdfHandler as it seems relevant to me.

@Mpaa I used some online pdf to djvu converter. I do not remember which one exactly, it could have been https://pdf2djvu.com/ or https://www.djvu-pdf.com/ .

However, I doubt that the text layer got improved by the djvu converting process.

Above I have shown how the PDF text layer is extracted in Mediawiki. The beginnign of the page is extracted thus:

...passing safely through a country occupied by Sigismund's
They called to
troops, they arrived near Kralov6 Hradec.

But when I open the PDF document in my computer and copypaste the text into a word processor, it looks thus:

...passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to...

This means that the original text layer of the PDF is good, only Mediawiki extracts it badly.

@Jan.Kamenicek, I didn't mean that the text layer got improved by the djvu converting process.
Flags used with pdftotext command matters, and they are set in Mediawiki, see my comparison above.

I was just curious to know what you used, as I cannot get the ''exact'' same djvu text layer of your file just playing with options of pdftotext, so knowing the tool you used might have shed some light on an optimal settings for pdftotext.

@Mpaa I see. I apologize for not getting your point, as I unfortunately do not understand the technical side of the problem at all, I just see that although the text layer of many PDF documents is very good, it turns very poor when I try to work with it at Wikisource Proofreading extension :-(

But when I open the PDF document in my computer and copypaste the text into a word processor, it looks thus:

...passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to...

This means that the original text layer of the PDF is good, only Mediawiki extracts it badly.

This is not indicative, it can depend on a lot of things (OS, browser, pdf plugin used, etc.). I tried once on Linux and two on Windows (one with AcrobatReader, one reading the pdf in Edge), and I got three different results.

Well, I guess that OS or browser may sometimes make a good text layer bad, but I doubt that they could turn a bad text layer into good. So if I get good results when reading the text layer outside of Mediawiki and bad results in Mediawiki, I am convinced that it means that the text layer is good and the problem must be on the Mediawiki side.