Page MenuHomePhabricator

Consider word-breaks as a way to improve readability in languages with long words
Open, MediumPublicDesign

Description

There are well-established readability norms for CPL measurements, and during the recent typographic upgrades, I've been paying close attention to CPL to make sure it doesn't get too big or too small, especially on mobile.

While I was doing this, I noticed that in languages like German that have many long words, the functional CPL is actually much lower than the max available on the screen. The result is that many lines only have 1 or 2 words on them with large, jagged spaces to their right.

For example:

Screenshot 2024-02-02 at 10.52.31 AM.png (1×966 px, 343 KB)

Using a simple word-break: break all in the CSS is not an ideal solution either. It recovers most of the screen width for a better CPL, but it breaks the words in non-optimal ways, often leaving a single letter on one line with the rest of the word on the following line, and doesn't append a "-" character as is normally the practice in typography.

Screenshot 2024-02-02 at 11.17.55 AM.png (1×782 px, 352 KB)

Typically, typographers go through a book or manuscript line-by-line and manually adjust these kinds of things so that the ragged right edge of the text remains proportionally correct. Obviously we're not going to do that.

I'm wondering, could we write a script that would change the layout of the main article body content such that words would break only when at least 3 letters are on the upper line, followed by a "-" and the rest of the word on the line below?

From

Screenshot 2024-02-02 at 12.17.03 PM.png (298×782 px, 48 KB)

To

Screenshot 2024-02-02 at 12.16.49 PM.png (292×780 px, 48 KB)

This wouldn't get us to optimal typography, but it will go a long way in articles that have a lot of long words that are being read on mobile.

Event Timeline

@Pginer-WMF Have you folks thought about this issue at all? Seems like it would be up your alley.

JScherer-WMF triaged this task as Low priority.
ovasileva subscribed.

Discussed in sprint planning, @JScherer-WMF will review a number of these tickets and combine into a single implementation ticket for sprint 6

@Pginer-WMF Have you folks thought about this issue at all? Seems like it would be up your alley.

From a typography perspective it makes sense to use word breaks to improve readability. I'm not familiar with the current technology to properly support different languages. I added @Nikerabbit, @santhosh
and @Amire80 who may have a better sense.

From my perspective I think it is important to consider:

  • Word breaks are applied in ways that respect the language rules. For example in Spanish it is correct to break "causa" into "cau- / sa", but not into "ca- / usa". These tend to follow comlex rules (e.g., you can chek point 2, with its 15 sub-points, in this page for Spanish)
  • Do not create unexpected artifacts. I'm thinking about languages where multiple parts of a "character" get assembled as users type and it is not expected to be divided. For example in Korean, typing "ㄷ", "ㅏ" and "ㄹ" results into "달", the algorithm for word breaking should thread "달" as a single character and not try to place the dash in-between. I don't know Korean grammar, but I can imagine braking the above as "다-" and "ㄹ" seems likely to be wrong.
  • Transfer unbroken words to the clipboard when copying content. Copying a paragraph should result in the contents copied without any layout-specific dashes. Otherwise users are forced to do tedious cleanup tasks if they want to copy content to make some slides.

Browsers natively support hyphenation(breaking the word at proper position) these days. No need to change the content for this. Following CSS example shows how to do this. I developed hyphenation system for Indian languages and that is what Chrome, Firefox, TeX, Libreoffice, Indesign etc using these days.

Example hyphenation setting for Malayalam. But each property can be tweaked to get optimal rendering per language. Minimum number of characters before break, after break etc can be configured as well

.hyphenate {
    hyphens: auto;
    hyphenate-character: "";
    hyphenate-limit-chars: 6 3 2;
    hyphenate-limit-last: always;
    hyphenate-limit-zone: 8%;
  }

Here is a codepen example that uses the above style https://codepen.io/santhoshtr/pen/dyrBbxL

Using hyphenation even in left aligned content is recommended for better readability. Here is my essay on web typography if interested https://docs.thottingal.in/web-typography

Recently, browsers shipped more detailed balancing for text wrapping. See https://developer.chrome.com/blog/css-text-wrap-pretty/

@JScherer-WMF thanks for raising this issue. There's one place in Vector where we're not using hyphens but we really should: the TOC. We're using word-break: break-word; instead of hyphens: auto; which breaks words without hyphens. I'd consider this a bug.

Screenshot 2024-04-05 at 11.19.49 AM.png (1×3 px, 1 MB)
Screenshot 2024-04-05 at 11.20.05 AM.png (1×3 px, 1 MB)
currentbetter

@santhosh Thanks for the context on this! I'll add it to the backlog and we can triage it.

I need to run through this with our devs @ovasileva to determine what they would need to estimate this re: @Pginer-WMF and @santhosh 's comments above.

JScherer-WMF raised the priority of this task from Low to Medium.Mon, Apr 8, 5:23 PM

@JScherer-WMF - leaving this one for the backlog for now. Let me know if that's okay or if you feel like we should prioritize it within the next couple of sprints. If you get a chance - could you also convert the format to the task form?