Bad text layer extraction from PDFs
Open, Needs TriagePublicBUG REPORT
Actions

Assigned To

None

Authored By

	Jan.Kamenicek
	Jan 7 2020, 10:30 PM

Description

If a scan in PDF has a text layer, Mediawiki extracts it very poorly. Even a very good text layer is extracted badly. DJVUs do not suffer this problem and their text layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its text layer usually improves too. If the text is copypasted from the PDF document into a word processor, it is good as well. This means that the text layer is good, only Mediawiki cannot get it well from PDFs.

Example of text layer extraction from a PDF here: https://en.wikisource.org/w/index.php?title=Page:The_Hussite_Wars,_by_the_Count_L%C3%BCtzow.pdf/70&action=edit&redlink=1

The same PDF scan was converted into DJVU and the result can be compared here: https://en.wikisource.org/w/index.php?title=Page:The_Hussite_wars,_by_the_Count_L%C3%BCtzow.djvu/70&action=edit&redlink=1

Most libraries including Internet Archive or HathiTrust offer downloading PDFs with text layers and not DJVUs. Besides that handling DJVU is difficult for many contributors, not only for newbies. So, we do need to fix the text layer extraction from PDFs.

Related Objects

Mentioned In: T363619: Remove option for PDF → DjVu conversion (phetools)
T135313: PDF file lost its resolution on proofreading edit mode
T298992: ProofreadPages: incorrect spacing between words in rendered PDF page

Event Timeline

Jan.Kamenicek created this task.Jan 7 2020, 10:30 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 7 2020, 10:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jan.Kamenicek added a project: ProofreadPage.Jan 7 2020, 10:33 PM

Jan.Kamenicek added a project: MediaWiki-extensions-PdfHandler.Jan 7 2020, 10:38 PM

Aklapper removed projects: MediaWiki-extensions-PdfHandler, Internet-Archive.Jan 8 2020, 11:45 AM

Playing with "pdftotext" options, output can be similar to djvu text layer.

e.g. see

$ pdftotext The_Hussite_Wars,_by_the_Count_Lützow.pdf -f 70 -l 70 tmp.txt -layout | cat tmp.txt 
48 THE HUSSITE WARS
passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to
...

vs.

$ pdftotext The_Hussite_Wars,_by_the_Count_Lützow.pdf -f 70 -l 70 tmp.txt | cat tmp.txt 
THE HUSSITE WARS

48

passing safely through a country occupied by Sigismund's
...

vs.

$ djvutxt -page=70 The_Hussite_wars,_by_the_Count_Lützow.djvu
48 
THE HUSSITE WARS 
passing safely through a country occupied by Sigismund's 
troops, they arrived near Kralov6 Hradec. They called to 
...

I was never able to get the same result, but it might depend on which options the program who converted from pdf to djvu has used (@Jan.Kamenicek, what did you use?)

Flags for "pdftotext" are set here:
https://github.com/wikimedia/mediawiki-extensions-PdfHandler/blob/df484dbe704a5d4ee902e38defeba991dbe21fab/includes/PdfImage.php#L152

One option could be to act here, but impact vs. current settings should be evaluated.

Mpaa added a project: MediaWiki-extensions-PdfHandler.Jan 8 2020, 10:14 PM

This might be useful to understand pdftotext options.
https://github.com/EmpowermentZone/EdSharp/blob/master/Convert/Xpdf/pdftotext.txt

@Aklapper , I readded MediaWiki-extensions-PdfHandler as it seems relevant to me.

@Mpaa I used some online pdf to djvu converter. I do not remember which one exactly, it could have been https://pdf2djvu.com/ or https://www.djvu-pdf.com/ .

However, I doubt that the text layer got improved by the djvu converting process.

Above I have shown how the PDF text layer is extracted in Mediawiki. The beginnign of the page is extracted thus:

...passing safely through a country occupied by Sigismund's
They called to
troops, they arrived near Kralov6 Hradec.

But when I open the PDF document in my computer and copypaste the text into a word processor, it looks thus:

...passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to...

This means that the original text layer of the PDF is good, only Mediawiki extracts it badly.

Jan.Kamenicek updated the task description. (Show Details)Jan 8 2020, 11:35 PM

@Jan.Kamenicek, I didn't mean that the text layer got improved by the djvu converting process.
Flags used with pdftotext command matters, and they are set in Mediawiki, see my comparison above.

I was just curious to know what you used, as I cannot get the ''exact'' same djvu text layer of your file just playing with options of pdftotext, so knowing the tool you used might have shed some light on an optimal settings for pdftotext.

@Mpaa I see. I apologize for not getting your point, as I unfortunately do not understand the technical side of the problem at all, I just see that although the text layer of many PDF documents is very good, it turns very poor when I try to work with it at Wikisource Proofreading extension :-(

In T242169#5788350, @Jan.Kamenicek wrote:

But when I open the PDF document in my computer and copypaste the text into a word processor, it looks thus:

...passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to...

This means that the original text layer of the PDF is good, only Mediawiki extracts it badly.

This is not indicative, it can depend on a lot of things (OS, browser, pdf plugin used, etc.). I tried once on Linux and two on Windows (one with AcrobatReader, one reading the pdf in Edge), and I got three different results.

Well, I guess that OS or browser may sometimes make a good text layer bad, but I doubt that they could turn a bad text layer into good. So if I get good results when reading the text layer outside of Mediawiki and bad results in Mediawiki, I am convinced that it means that the text layer is good and the problem must be on the Mediawiki side.

JAnD subscribed.Jan 24 2020, 10:10 AM

MJL moved this task from Backlog to Backlog (Proofreader) on the All-and-every-Wikisource board.Nov 8 2020, 6:23 PM

Xover mentioned this in T298992: ProofreadPages: incorrect spacing between words in rendered PDF page.Jan 12 2022, 12:33 PM

TheDJ subscribed.Jun 14 2022, 10:53 PM

TheDJ mentioned this in T135313: PDF file lost its resolution on proofreading edit mode.Jun 14 2022, 11:11 PM

Arcorann subscribed.Aug 26 2024, 12:05 AM

Arcorann mentioned this in T363619: Remove option for PDF → DjVu conversion (phetools).Aug 26 2024, 12:18 AM