Page MenuHomePhabricator

WS Export in hewikisource unreadable: Large fonts, translated (?) markup tags, many square boxes
Closed, ResolvedPublic

Description

As a Wikisource user, I want the bug in which boxes display instead of characters in Hebrew Wikisource to be examined and ideally fixed, so books can be properly exported and read.

Example Link: https://ws-export.wmcloud.org/?page=%D7%A7%D7%94%D7%9C%D7%AA_%D7%91/%D7%98%D7%A2%D7%9E%D7%99%D7%9D&lang=he&format=pdf

Acceptance Criteria:

  • Investigate bug in in which boxes display in Hebrew Wikisource and:
    • Try to determine the cause of the issue
    • Share findings in comment in this ticket
    • If possible, issue fix or suggest next steps based on findings

Visual Example:

Screen Shot 2021-02-18 at 4.29.53 PM.png (602ร—425 px, 124 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
Aklapper renamed this task from WS Export in hewikisource to WS Export in hewikisource unreadable: Large fonts, translated (?) markup tags, many square boxes.Feb 11 2021, 1:30 PM

Attaching result for comparison:

I'm not sure what's going on here, but the output from Parsoid looks wrong: https://he.wikisource.org/api/rest_v1/page/html/%D7%A7%D7%94%D7%9C%D7%AA_%D7%91%2F%D7%98%D7%A2%D7%9E%D7%99%D7%9D

@Mooeypoo I wonder if you might have an look at this briefly?

The generated PDF file is so horrible in so many ways. Basicly WS Export should create a PDF file which is visually equivalent to a PDF file generated by the "Print to PDF" option of the browser.

  1. WS Export must import all Common.css and Print.css styles as default. Those styles are used when a page is printed on paper using physical printer, or to PDF file using a virtual printer. There is no reasons not to use those styles by when printed by WS Export.
  2. It fails to load Epub.css styles.
  3. It seems the font size is calculated in a wrong way, resulting huge letters in the printable form.
  4. The Hebrew font is the default sans font of the system. The default font does not support Hebrew Cantillation marks (Unicode U+0591 to U+05AF).
  5. Webfonts are not loaded.
  6. WS Export fails to parse special tags like <ืงื˜ืข ื”ืชื—ืœื”> and <ืงื˜ืข ืกื•ืฃ>.

Another example: https://he.wikisource.org/wiki/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื and the PDFs of https://ws-export.wmcloud.org/?page=ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื&lang=he&format=pdf and of https://he.wikisource.org/api/rest_v1/page/pdf/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื
Attached you may find three PDFs files. The first one was created by WS Export, the second one is the equivalent PDF file created by "Save to PDF", and the third PDF created by ElectronPdfService REST API.

We discussed this in estimation today. The boxes are due to the Parsoid work. We will be collecting a more comprehensive list of the Parsoid-related bugs, which we'll share with the Parsoid team soon. In short, work is in progress on reporting this issue to the relevant team.

@Samwilson Looks like the Parsoid bug here is with item #6 in Fuzzy's comment above where Parsoid is failing to parse some tags. But, CSS and other issues are probably WS-Export specific. Is that your understanding as well?

@Samwilson Do you know if this is a one-off problem on a single page or if this is more widespread?

@ifried โ€“ Boxes are Unicode characters not supported by whatever font is used (#4 in my list). This is not parser bug.
@dmaza โ€“ The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

General question: How can we configure the Hebrew Wikisource so "Download as PDF" will use PDFService API (see T274521#6822740) instead of WS-Export?

@dmaza โ€“ The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

This is definitely a Parsoid issue. We will have to address this on the Parsoid end.

Change 674705 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Change 674705 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Arlolra subscribed.

@dmaza โ€“ The tags failed to be parsed (#6 in my list) are <section begin> and <section end> of the Labeled Section Transclusion extension. This extension is widely used within the Hebrew Wikisource.

This is definitely a Parsoid issue. We will have to address this on the Parsoid end.

That part is done and should go out with the train next week

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

After the deploy and running ?action=purge, the section tags on the page in question are being recognized,
https://he.wikisource.org/api/rest_v1/page/html/%D7%A7%D7%94%D7%9C%D7%AA_%D7%91%2F%D7%98%D7%A2%D7%9E%D7%99%D7%9D

@ifried As mention in T274521#6921841, the remaining issues seem to do with the font, which isn't a parsing issue. Let us know if you need anything else from the Parsing Team here.

The remaining issues here look like they're to do with font support. I've installed the Culmus fonts (as recommended here) and with, for example, Yehuda CLM the above page looks like this:

hebrewtest-p2.png (2ร—1 px, 476 KB)

Is this correct?

A default font can be set by modifying the Hebrew Wikisource's WS_Export.json with e.g.

{
    "defaultFont": "Yehuda CLM"
}

@Fuzzy Does the screenshot posted by Sam above look correct? Have you tried modifying (creating) WS_Export.json in the way suggested (probably needs an admin to do) and checked if that fixes the problems?

The problem is not a specific font. The problem is ignoring CSS styles. As said, in order for the WS Export to work โ€“

  • It must import all Common.css and Print.css styles. Those styles are used when a page is printed on paper using physical printer, or to PDF file using a virtual printer.
  • It should load Epub.css styles.
  • It should load and render Webfonts properly.

Still, the fundamental issue is that WS Export is not suitable for exporting single articles. We asked to continue with Electron-PDFs as default method of export (see T280637).

P.S., In the near future we have to implement conditional export links. We have an external service that converts legal texts to DOCX files (courtesy of the Israeli Ministry of Justice). When the service becomes mature, we will have to replace the "export to DOC" and "export to PDF" links for a distinctive set of articles. This is another issue, TBD, and irrelevant to the current one.

Hmm. Webfonts is, I think, T270743, and those are not currently supported it seems.

In terms of stylesheets, ws-export is documented to load MediaWiki:Epub.css. Is that not working for heWS? It wouldn't make sense to load all of Common.css and Print.css for this, especially since these sources of style information should ideally be empty or nearly so since they are loaded unconditionally for all users on every single page, whether they are needed or not (BTW, I took a look at heWS's Common.css/Print.css and those are really massive. Does heWS have special requirements that necessitate keeping all those styles in global stylesheets, or have the migration to features like TemplateStyles simply not been undertaken yet?).

I don't understand what you mean by "is not suitable for exporting single articles". Could you elaborate? Do you mean something like a single chapter of a book?

I'm aware of your site request to disable ws-export, I am just trying to follow up in the hopes of finding some way for all the Wikisourcen to use this tool together. That would let us share knowhow, documentation, etc.; and would make it easier to persuade the WMF to assign us scarce developer resources. Just as an example, support for ULS webfonts would benefit enWS as well.

I'd suggest to move the discussion about suitability of the WS-Export extension to the Hebrew Wikisource limited to T280637, and keep this discussion to rendering bugs.

Regarding the stylesheets, Common.css contains all styles used by any wikisource article. It includes, by design, many styles that are not used by a specific article. Print.css overrides some styles for printable version. Both stylesheets should be applied when exporting an article to any e-Book format. Of course, the exported book should not include unused styles. I think the solution is to analyze which styles are actually used by a specific article. This is also the case for embedding webfonts only when needed.

As you wrote, WS-Export should load MediaWiki:Epub.css, but apparently it doesn't work.

The problem is not a specific font.

This issue is mostly about fonts (the other part of it, the markup tags, has been fixed). The description says "Investigate bug in in which boxes display in Hebrew Wikisource", and those boxes are exactly because of the font being used. At the moment, the easiest fix is as I mentioned in T274521#7019327 above, to set a default font.

I totally understand that this is not the optimum fix though!

It must import all Common.css and Print.css styles. Those styles are used when a page is printed on paper using physical printer, or to PDF file using a virtual printer.

Regarding the stylesheets, Common.css contains all styles used by any wikisource article. It includes, by design, many styles that are not used by a specific article. Print.css overrides some styles for printable version. Both stylesheets should be applied when exporting an article to any e-Book format. Of course, the exported book should not include unused styles. I think the solution is to analyze which styles are actually used by a specific article. This is also the case for embedding webfonts only when needed.

I see where you're coming from, and I think for common.css there might be an argument to be made (although I also think the common.css on Hebrew Wikisource contains a lot of stuff that should probably be in templatestyles). I suggest we start a new task for discussing the inclusion of other stylesheets.

It should load Epub.css styles.

It does. Do you have an example of a page on which these styles are not working?

It should load and render Webfonts properly.

Definitely. I've created T294509 to track this.

...although I also think the common.css on Hebrew Wikisource contains a lot of stuff that should probably be in templatestyles

We don't have problem of interface editors. The template styles extension seems to be nice hack to split Common.css for different sub-projects, and I will check it in the future. However, it doesn't matter much as WS-Export should handle both template styles and inclusion of common.css and print.css styles.

It should load Epub.css styles.

It does. Do you have an example of a page on which these styles are not working?

I places the relevant stylesheets at MediaWiki:Epub.css and yet โ€“

Another example: https://he.wikisource.org/wiki/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื and the PDFs of https://ws-export.wmcloud.org/?page=ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื&lang=he&format=pdf and of https://he.wikisource.org/api/rest_v1/page/pdf/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื

I places the relevant stylesheets at MediaWiki:Epub.css and yet โ€“
Another example: https://he.wikisource.org/wiki/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื and the PDFs of https://ws-export.wmcloud.org/?page=ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื&lang=he&format=pdf and of https://he.wikisource.org/api/rest_v1/page/pdf/ื—ื•ืง_ื–ื›ื•ืช_ื™ื•ืฆืจื™ื

Sorry, I'm not quite sure what parts here are demonstrating the absence of Epub.css, could you explain more? The CSS is definitely being included in the epub, which is then used to create the PDF โ€” perhaps the issues here are about Calibre's rendering of PDFs?

One difference I notice is that the Knesset_legislation_database.png image doesn't appear: this is because it's got the class noprint and this is has display: none in Epub.css.

That example might also look better on A4 instead of A5, and with a better font: https://ws-export.wmcloud.org/?page=%D7%97%D7%95%D7%A7_%D7%96%D7%9B%D7%95%D7%AA_%D7%99%D7%95%D7%A6%D7%A8%D7%99%D7%9D&lang=he&format=pdf-a4&fonts=Yehuda%20CLM

(@Fuzzy just pinging you in case you didn't see the above.)

This comment was removed by Fuzzy.

Here is a list of open bugs with WS Export. I'll update the list when I have time.

  • (block) Use specified font family and font size according to stylesheets. Import webfonts if necessary.
  • (low) Import common.css and print.css in addition to epub.css.
  • (high) Use @print { size: ... } as default page size.
  • (high) Invalid ankors with external links. External links such as [[ื—ื•ืง ื”ืขื•ื ืฉื™ืŸ#ืกืขื™ืฃ 61]] are converted to a the correct destination page (he.wikisource.org/wiki/ื—ื•ืง_ื”ืขื•ื ืฉื™ืŸ) but with ankor #s_yp_61 instead of #ืกืขื™ืฃ_61. [Note: sometimes the ankors are kept, conversion is inconsistent.]
  • (high) Keep internal links and ankors. Internal links are converted to the source page with invalid ankors.
  • (medium) Don't include title page and "About" page when .printfooter has display: none; property.

Thanks for the list, this is very useful! (Although please, rather than updating it when you have time, can you create separate tasks for each bug you find? That would makes things easier to manage, and would keep this task on-topic.)

(block) Use specified font family and font size according to stylesheets. Import webfonts if necessary.

Tracked in T294509: Include webfonts specified in CSS

(low) Import common.css and print.css in addition to epub.css.

I don't think this is a good idea, for the reasons that Xover explained above.

(high) Use @print { size: ... } as default page size.

Tracked in T298226: Allow use of @page CSS for PDFs

(high) Invalid ankors with external links. External links such as [[ื—ื•ืง ื”ืขื•ื ืฉื™ืŸ#ืกืขื™ืฃ 61]] are converted to a the correct destination page (he.wikisource.org/wiki/ื—ื•ืง_ื”ืขื•ื ืฉื™ืŸ) but with ankor #s_yp_61 instead of #ืกืขื™ืฃ_61. [Note: sometimes the ankors are kept, conversion is inconsistent.]

I've tested this on betawikisource:Links and it appears to be working correctly. Could you link to an instance in which it's failing?

(high) Keep internal links and ankors. Internal links are converted to the source page with invalid ankors.

Do you have an example of this? It sounds a little bit like T275632 but perhaps a bit different.

(medium) Don't include title page and "About" page when .printfooter has display: none; property.

I don't understand the relationship between these pages and the .printfooter class (neither of them contain that class), but it does seem like an option to exclude the Title and About pages could be useful. There would probably still need to be something somewhere in the book that says where it's come from, but perhaps it could be smaller than the existing About page. I've created T298227 to discuss this.


To return to the topic at hand, I note that there is still no default font set in ืžื“ื™ื” ื•ื™ืงื™:WS Export.json. I think that adding a good Hebrew-supporting font will solve the main font problems that are described here. (I understand that you'd rather have full support for CSS-specified fonts, but it's a pretty simple fix to get something working better in the short term.)

Since this ticket has helped us scope out more specific work in the tickets @Samwilson has cut, we are resolving this one and using the others as more detailed work.