Page MenuHomePhabricator

ProofreadPages: incorrect spacing between words in rendered PDF page
Closed, ResolvedPublicBUG REPORT

Description

The spacing between words in the rendered page may be incorrect. Compare the underlined passages.

Screenshot from 2022-01-11 18-53-41.png (1×3 px, 2 MB)

LEFT (expected):
https://commons.wikimedia.org/wiki/File:ISC_Russia_Report.pdf
https://upload.wikimedia.org/wikipedia/commons/3/3d/ISC_Russia_Report.pdf

RIGHT (actual):
https://en.wikisource.org/wiki/Page:ISC_Russia_Report.pdf/12

All on Firefox 95, Linux. This is a high-profile recent government report, so it is very unlikely that Firefox is wrong. This bug has already led to incorrect proofreading of pages on Wikisource.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.: 1.38.0-wmf.16 (rMW5b4a658ccd69)

Event Timeline

Removing Browser-support-print-media as this is not about printing content (see its description).

Last I heard (I could be wrong), MediaWiki-extensions-PdfHandler (ping @Tgr) uses Ghostscript to render PDFs to JPEG thumbnails, meaning this is most likely an upstream bug affecting certain born-digital PDFs. Best case for fixing it is probably using a newer version of ghostscript, which I'm guessing would be blocked on T289228. If it can be reproduced in base latest-version ghostscript it should probably reported upstream, and a fix here would then also depend on when upstream makes a release with a fix. Alternately, there is T38594; but I suspect it'd be fairly resource-intensive on the MediaWiki side, and I have no idea what the relative merits of Ghostscript and MuPDF are. A switch might conceivably have a positive effect on the problem described in T242169 (or it might not; or it might make it worse).

PdfHandler is listed as stewarded by Web-Team-Backlog (I think, the list isn't very clear) and maintained by @Bawolff (yes, I know you're not …(WMF) anymore, but that's what the list says; apologies for the spam, again). It links on to the extensions list, which says PdfHandler is owned by #product-infrastructure-team-backlog. In any case, this isn't a problem in ProofreadPage, it's a problem in PdfHandler that just happens to affect (be visible in) ProofreadPage.

TheDJ subscribed.

Probably a font specific problem, by the looks of it.

Yeah, so the console says:
Loading font Times-Roman (or substitute) from /usr/local/Cellar/ghostscript/9.56.1/share/ghostscript/9.56.1/Resource/Font/NimbusRoman-Regular
which it doesn't say for other conversions that i have done.

Output from pdffonts

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
LHDRMR+ArialMT                       TrueType          WinAnsi          yes yes yes   2554  0
LHDRMR+TimesNewRomanPSMT             TrueType          WinAnsi          yes yes yes   2556  0
LHDRMR+TimesNewRomanPSMT             TrueType          WinAnsi          yes yes yes   2557  0
GFONUB+Calibri                       TrueType          WinAnsi          yes yes yes    227  0
QHMVEH+LucidaSansUnicode             TrueType          WinAnsi          yes yes yes    229  0
QHMVEH+TimesNewRomanPS-ItalicMT      TrueType          WinAnsi          yes yes yes    231  0
LFXMMR+TimesNewRomanPS-BoldMT        TrueType          WinAnsi          yes yes yes    233  0
VHVUWX+TimesNewRomanPS-BoldItalicMT  TrueType          WinAnsi          yes yes yes    235  0
FEYXGD+Calibri-Bold                  TrueType          WinAnsi          yes yes yes    236  0
AHJTON+SymbolMT                      CID TrueType      Identity-H       yes yes yes    241  0
Times-Roman                          Type 1            WinAnsi          no  no  yes    242  0

As can be seen, the font with the glyphs for the Times-Roman text is not included in the pdf. Ghostscript doesn't have it (proprietary font) and thus uses NimbusRoman-Regular.
I'm not sure if a better mapping can be made, but i doubt MuPDF is gonna help with this, considering that we are running on debian, which does not include proprietary fonts. Maybe there is a better replacement font ? Resaving the PDF with fonts included will likely fix this however.

TheDJ claimed this task.

Resaved as PDF/A and uploaded.