Page MenuHomePhabricator

WS Export in hewikisource unreadable: Large fonts, translated (?) markup tags, many square boxes
Open, Needs TriagePublic

Description

As a Wikisource user, I want the bug in which boxes display instead of characters in Hebrew Wikisource to be examined and ideally fixed, so books can be properly exported and read.

Example Link: https://ws-export.wmcloud.org/?page=%D7%A7%D7%94%D7%9C%D7%AA_%D7%91/%D7%98%D7%A2%D7%9E%D7%99%D7%9D&lang=he&format=pdf

Acceptance Criteria:

  • Investigate bug in in which boxes display in Hebrew Wikisource and:
    • Try to determine the cause of the issue
    • Share findings in comment in this ticket
    • If possible, issue fix or suggest next steps based on findings

Visual Example:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper renamed this task from WS Export in hewikisource to WS Export in hewikisource unreadable: Large fonts, translated (?) markup tags, many square boxes.Feb 11 2021, 1:30 PM

Attaching result for comparison:

I'm not sure what's going on here, but the output from Parsoid looks wrong: https://he.wikisource.org/api/rest_v1/page/html/%D7%A7%D7%94%D7%9C%D7%AA_%D7%91%2F%D7%98%D7%A2%D7%9E%D7%99%D7%9D

@Mooeypoo I wonder if you might have an look at this briefly?

The generated PDF file is so horrible in so many ways. Basicly WS Export should create a PDF file which is visually equivalent to a PDF file generated by the "Print to PDF" option of the browser.

  1. WS Export must import all Common.css and Print.css styles as default. Those styles are used when a page is printed on paper using physical printer, or to PDF file using a virtual printer. There is no reasons not to use those styles by when printed by WS Export.
  2. It fails to load Epub.css styles.
  3. It seems the font size is calculated in a wrong way, resulting huge letters in the printable form.
  4. The Hebrew font is the default sans font of the system. The default font does not support Hebrew Cantillation marks (Unicode U+0591 to U+05AF).
  5. Webfonts are not loaded.
  6. WS Export fails to parse special tags like <קטע התחלה> and <קטע סוף>.

Another example: https://he.wikisource.org/wiki/חוק_זכות_יוצרים and the PDFs of https://ws-export.wmcloud.org/?page=חוק_זכות_יוצרים&lang=he&format=pdf and of https://he.wikisource.org/api/rest_v1/page/pdf/חוק_זכות_יוצרים
Attached you may find three PDFs files. The first one was created by WS Export, the second one is the equivalent PDF file created by "Save to PDF", and the third PDF created by ElectronPdfService REST API.

We discussed this in estimation today. The boxes are due to the Parsoid work. We will be collecting a more comprehensive list of the Parsoid-related bugs, which we'll share with the Parsoid team soon. In short, work is in progress on reporting this issue to the relevant team.

@Samwilson Looks like the Parsoid bug here is with item #6 in Fuzzy's comment above where Parsoid is failing to parse some tags. But, CSS and other issues are probably WS-Export specific. Is that your understanding as well?

@Samwilson Do you know if this is a one-off problem on a single page or if this is more widespread?

@ifried – Boxes are Unicode characters not supported by whatever font is used (#4 in my list). This is not parser bug.
@dmaza – The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

General question: How can we configure the Hebrew Wikisource so "Download as PDF" will use PDFService API (see T274521#6822740) instead of WS-Export?

@dmaza – The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

This is definitely a Parsoid issue. We will have to address this on the Parsoid end.

Change 674705 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Change 674705 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Arlolra added a subscriber: Arlolra.

@dmaza – The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

This is definitely a Parsoid issue. We will have to address this on the Parsoid end.

That part is done and should go out with the train next week

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

After the deploy and running ?action=purge, the section tags on the page in question are being recognized,
https://he.wikisource.org/api/rest_v1/page/html/%D7%A7%D7%94%D7%9C%D7%AA_%D7%91%2F%D7%98%D7%A2%D7%9E%D7%99%D7%9D

@ifried As mention in T274521#6921841, the remaining issues seem to do with the font, which isn't a parsing issue. Let us know if you need anything else from the Parsing Team here.

The remaining issues here look like they're to do with font support. I've installed the Culmus fonts (as recommended here) and with, for example, Yehuda CLM the above page looks like this:

Is this correct?

A default font can be set by modifying the Hebrew Wikisource's WS_Export.json with e.g.

{
    "defaultFont": "Yehuda CLM"
}