Page MenuHomePhabricator

Missing section links in the ToC of generated PDF
Open, Stalled, LowestPublic

Description

The generated PDF from a wiki page has some page internal links to sections missing.

Steps to reproduce

  1. Go to https://fr.wikiquote.org/wiki/Michel_Henry
  2. Generate the PDF using the link "Télécharger comme PDF".
  3. Download and open the PDF.
  4. Observe that in the resulting PDF, some entries in the table of contents are not links.

Actual

These ToC entries (plus more) are not links (some problematic characters are emphasized):

  • Phénoménologie matérielle, 1990
  • Gabrielle Dufour-Kowalska, philosophe française (1939-2015)
  • Rolf Kühn, philosophe allemand (1944- )

Expected

All ToC entries should be links.

Initial investigation

We can reproduce this by simply opening this page in a Chromium-based browser (I've tried Chrome and Brave so far), right-click, Print..., print as PDF. (BTW, there are not even any links generated when using Firefox).

There are issues with the ToC links containing accented characters and umlauts.
reduced test case wiki page.

Minimal test case in HTML
<html>
<head>
    <meta charset="UTF-8">
</head>
<body>
<ul>
    <li><a href="#section1Ü">Section 1 Ü</a></li>
    <li><a href="#section2é">Section 2 é</a></li>
    <li><a href="#section3(">Section 3 (</a></li>
    <li><a href="#section4'">Section 4 '</a></li>
</ul>
<div style="height: 150px"></div>
<h2 id="section1Ü">Section 1 Ü</h2>
Missing link
<h2 id="section2é">Section 2 é</h2>
Missing link
<h2 id="section3(">Section 3 (</h2>
Has link
<h2 id="section4'">Section 4 '</h2>
Has link
</body>
</html>

Upstream bug: https://bugs.chromium.org/p/chromium/issues/detail?id=985254

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 7 2020, 12:42 PM

Hi @PhilippeAudinos, thanks for taking the time to report this and welcome to Wikimedia Phabricator!
This seems to be about https://fr.wikiquote.org/wiki/Michel_Henry and clicking "Télécharger comme PDF" in the side bar to get a PDF.

The content links are not active between two book titles (or sections) containing a single quote, like for example "L'essence de la manifestation" or "Voir l'invisible. Sur Kandinsky".

I don't know what a "content link" is. Can you please provide steps to reproduce, what you expect to happen, and what happens instead? Please see and follow https://www.mediawiki.org/wiki/How_to_report_a_bug - thanks!

PhilippeAudinos added a comment.EditedMay 9 2020, 2:20 PM

Hello,

What I call "content links" are just the links available in the "table of contents" or "Sommaire" of the generated PDF file, used to reach the corresponding section in the PDF file.

When I open the Michel Henry PDF file generated by Wikiquote with Adobe Acrobat Reader, and click for example on the title "L'essence de la manifestation" in the "Sommaire", it works fine.

But the following link for "Phénoménologie matérielle" in the "Sommaire" doesn't work. The link is not active when we fly over it with the mouse.

And there is the same problem for all books until "Marx II. Une philosophie de l'économie", which also contains a single quote.

The links work correctly for the books "Du communisme au capitalisme" to "Voir l'invisible. Sur Kandinsky", which also contains a single quote in the title.

And so on.

Each time a book title contains a single quote, the following links start to work fine or to be not available.

I hope this description is clearer.

Best regards,
Philippe Audinos

Aklapper renamed this task from Problems with the content links in the PDF file generated, with the French article of wikiquote on Michel Henry to Lack of some internal anchor links in the ToC of a PDF file generated on French Wikiquote.May 10 2020, 3:16 PM

Thanks! I've changed the task summary a bit :)

Vorlod added a subscriber: Vorlod.May 12 2020, 8:58 PM
LGoto triaged this task as Lowest priority.Jul 15 2020, 3:50 PM
LGoto moved this task from Needs triage to Backlog on the Product-Infrastructure-Team-Backlog board.
PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC of a PDF file generated on French Wikiquote to Lack of some internal anchor links in the ToC of a PDF files generated on French Wikiquote and Wikipedia.Jul 26 2020, 1:36 PM
PhilippeAudinos added a comment.EditedJul 26 2020, 1:42 PM

Hello,

This problem is in fact a general problem in the table of contents of all PDF generated on French Wikiquote and Wikpedia articles.

All titles containing a single quote or a parenthesis (and perhaps also a comma) lead to problems with the following internal anchor links of the table of contents of the generated PDF file.

For example, in the PDF file generated for the article on "France" of the French Wikipedia, only half of the internal anchor links of the table of contents are actually available (which corresponds to about 30 missing links).

Best Regards,
Philippe Audinos

PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC of a PDF files generated on French Wikiquote and Wikipedia to Lack of some internal anchor links in the ToC of many PDF files generated on French Wikiquote and Wikipedia.Jul 26 2020, 1:43 PM

Assuming this is about the Proton codebase.

PhilippeAudinos added a comment.EditedJul 27 2020, 1:44 PM

Hello,

The problem of missing anchor links in the PDF file due to the presence of parenthesis or of a dash in the titles concerns also the English Wikipedia, as can be seen for example with the articles on "France" (7 missing links), on "United States" (1 missing link), on "Abraham Lincoln" (3 missing links), on "Winston Churchill" (25 missing links), on "History of England" (2 missing links) and on "History of Ireland" (13 missing links) in the English version of Wikipedia.

So I think that this is not really a "Lowest" priority task, but an "High" priority task instead. Or at least a "Medium" priority task, as it concerns potentially all the PDF files generated in all versions of Wikiquote and Wikipedia in all available languages...

Best regards,
Philippe Audinos

Aklapper renamed this task from Lack of some internal anchor links in the ToC of many PDF files generated on French Wikiquote and Wikipedia to Lack of some internal anchor links in the ToC (due to presence of parenthesis in section title?).Jul 27 2020, 6:23 PM
PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC (due to presence of parenthesis in section title?) to Lack of some internal anchor links in the ToC (due to presence of parenthesis, single quote or dash in section title?).Jul 28 2020, 2:36 PM
PhilippeAudinos raised the priority of this task from Lowest to Medium.Jul 31 2020, 11:05 AM
PhilippeAudinos added a comment.EditedJul 31 2020, 11:13 AM

Hello,

I have changed the priority of this task to "Medium" priority.

I really think that this task is not a "Lowest" priority task as it had been triaged by the user "LGoto", but a "Medium" or even an "High" priority task, because it concerns potentially all the PDF files generated in all versions of Wikiquote and Wikipedia in all available languages, as I explained it in my previous comment.

Do not hesitate to change the priority of this task to "High" priority if you think this is more appropriate.

Best regards,
Philippe Audinos

PhilippeAudinos added a comment.EditedJul 31 2020, 11:58 AM

Hello,

I have also tested for missing links in the articles on the 3 last Presidents of the United States of America, in the French and English versions of Wikipedia.

The result is the following : 19 missing links in French and 3 missing links in English for "George W. Bush" ; 18 missing links in French and 4 missing links in English for "Barack Obama" ; 43 missing links in French and 0 missing links in English for "Donald Trump".

So the favorite and the winner is apparently Donald Trump in all the categories... ;-)

Best regards,
Philippe Audinos

Aklapper raised the priority of this task from Medium to Needs Triage.Jul 31 2020, 1:23 PM

Resetting priority for a potential reevaluation.

PhilippeAudinos added a comment.EditedAug 1 2020, 9:36 AM

Hello,

I have also tested for fun the missing links in the PDF files generated for the articles on the 3 last Presidents of France, in the French and English versions of Wikipedia.

The result is the following : 41 missing links in French and 8 missing links in English for "Nicolas Sarkozy" ; 24 missing links in French and 2 missing links in English for "François Hollande" ; 42 missing links in French and 1 missing link in English for "Emmanuel Macron".

In fact, the missing links problem in the French version of Wikipedia comes mainly from the presence of accentuated characters in the French titles and subtitles.

And the presence of parenthesis in the tiles and subtitles doesn't really seem to be a problem for the anchor links...

Best regards,
Philippe Audinos

PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC (due to presence of parenthesis, single quote or dash in section title?) to Lack of some internal anchor links in the ToC (due to presence of accentuated characters, parenthesis, single quote or dash in section title?).Aug 1 2020, 9:41 AM
PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC (due to presence of accentuated characters, parenthesis, single quote or dash in section title?) to Lack of some internal anchor links in the ToC (due to presence of accentuated characters, parenthesis, single quote or dash in section and subsection titles).
PhilippeAudinos renamed this task from Lack of some internal anchor links in the ToC (due to presence of accentuated characters, parenthesis, single quote or dash in section and subsection titles) to Lack of many internal anchor links in the ToC (due to presence of accentuated characters, single quote or dash in section and subsection titles).Aug 1 2020, 9:50 AM
PhilippeAudinos renamed this task from Lack of many internal anchor links in the ToC (due to presence of accentuated characters, single quote or dash in section and subsection titles) to Lack of many internal anchor links in the ToC (due to presence of accentuated characters, single quote or dash in section title).Aug 1 2020, 9:54 AM
bearND added a subscriber: bearND.EditedAug 1 2020, 5:36 PM

Examples from printing as PDF of https://fr.wikiquote.org/wiki/Michel_Henry:

The single quotes in the ToC links seem fine to me.

  • L'Essence de la manifestation, 1963

There are issues with the ToC links containing accented characters:

  • Phénoménologie matérielle, 1990
  • Gabrielle Dufour-Kowalska, philosophe française (1939-2015)

I've extracted smaller test cases to clarify this further, see task description.

bearND renamed this task from Lack of many internal anchor links in the ToC (due to presence of accentuated characters, single quote or dash in section title) to Missing section links in the ToC of generated PDF.Aug 2 2020, 3:08 AM
bearND updated the task description. (Show Details)
TheDJ added a subscriber: TheDJ.

This is an upstream bug in Chrome. Even Print To PDF of Chrome on MacOS seems to be affected.
The problem seems that the entire link area seems missing (checked with PDF Expert).

Checking for existing upstream bugs...

TheDJ updated the task description. (Show Details)Aug 7 2020, 9:45 PM
TheDJ moved this task from Backlog to Reported Upstream on the Upstream board.
TheDJ changed the task status from Open to Stalled.EditedAug 7 2020, 9:50 PM
TheDJ triaged this task as Lowest priority.Aug 7 2020, 9:50 PM

Stalled on upstream.

https://bugs.chromium.org/p/chromium/issues/detail?id=985254 says Status:Fixed (Closed) since August 27, and "This particular issue seems to be fixed now, should be in M87. Some follow up problems will need to be addressed in https://bugs.chromium.org/p/chromium/issues/detail?id=1117212 ."

https://chromiumdash.appspot.com/schedule states that "Stable Release" of 87 in on 2020-11-17.

Jgiannelos added a subscriber: Jgiannelos.EditedOct 7 2020, 2:15 PM

Tested on latest Chrome Canary on Mac and it looks like the fix is already there

This will be fixed after chromium 87 reaches stable release (17 nov according to schedule) and when debian releases that version

Jgiannelos added a comment.EditedNov 24 2020, 6:48 PM

Although chromium 87 is released, debian buster is still using an old version (83.0.4103.116-1~deb10u3) .
The puppeteer package that we install from npm comes with pre-bundled binaries for browsers. Should we consider using that instead of the debian package?
Usually we prefer software packaged for debian but it might take a while until packages for 87 reach buster.