Page MenuHomePhabricator

improve detectTofu algorithm so it can detect replacement characters in fixed-width glyphs
Open, NormalPublic

Description

The detectTofu function finds glyphs which are missing from a font (and so are replaced by a replacement character or "tofu").

The current algorithm works as follows:

  1. Measure the rendered width/height of each character in a test string.
  2. Compare to a character that is known to be replaced.
  3. If each character is the same size (including the replacement), then conclude that all characters are missing glyphs.

This works very well for many languages. However, it fails for Chinese, because typically all Han character glyphs in a font are the same size as the replacement character glyph. Also, there is no such thing as a 'complete Han font': there are always missing characters.

Therefore, we should implement a more sophisticated approach:

  1. Start with the above algorithm for speed.
  2. Render a character to an HTML canvas.
  3. Compare its bitmap to the bitmap of the replacement character glyph.

This will allow us to detect exactly which characters are missing, regardless of width/height.


Version: unspecified
Severity: enhancement
See Also:
T33791: Add web fonts for Chinese scripts

Details

Reference
bz63122

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:04 AM
bzimport set Reference to bz63122.
bzimport added a subscriber: Unknown Object (MLST).
dchan created this task.Mar 26 2014, 5:55 PM
Qgil added a comment.Mar 26 2014, 6:00 PM

Please don't take this bug unless you are a GSoC student working on Bug 31791 - Add web fonts for Chinese scripts. Thank you.

xiaoxiangquan wrote:

Patch for the detect chinese tofu function

Hope it can work. I just finish the function, while don't know how to integrate it with ULS currently.

Attached:

xiaoxiangquan wrote:

A simple test page

it will alert a tofu char, and a not-tofu char

Attached:

(In reply to Xiangquan Xiao from comment #2)

Created attachment 14941 [details]
Patch for the detect chinese tofu function

Thanks for your patch!
You are welcome to use Developer access

https://www.mediawiki.org/wiki/Developer_access

to submit this as a Git branch directly into Gerrit:

https://www.mediawiki.org/wiki/Git/Tutorial

Putting your branch in Git makes it easier to review it quickly. If you don't want to set up Git/Gerrit, you can also use https://tools.wmflabs.org/gerrit-patch-uploader/
Thanks again! We appreciate your contribution.

Attached:

xiaoxiangquan wrote:

(In reply to Andre Klapper from comment #4)

Putting your branch in Git makes it easier to review it quickly. If you
don't want to set up Git/Gerrit, you can also use
https://tools.wmflabs.org/gerrit-patch-uploader/
Thanks again! We appreciate your contribution.

Thanks for the information. I've set up gerrit by following the tips.

Actually it's an incomplete fix. So I just leave the test page there to show how it works, as my GSoC application's microtask.

A complete fix will be submitted soon using Gerrit.

dchan added a comment.Mar 27 2014, 6:00 PM

Thanks Xiangquan, that's an extremely good start!

When you submit to gerrit, I'll post more detailed comments there.

Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a line of its own, immediately above the change ID, with no extra whitespace.
Then gerrit will post comments automatically to this bug.

xiaoxiangquan wrote:

(In reply to David Chan from comment #6)

Thanks Xiangquan, that's an extremely good start!
When you submit to gerrit, I'll post more detailed comments there.
Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a
line of its own, immediately above the change ID, with no extra whitespace.
Then gerrit will post comments automatically to this bug.

Hi, I want to make something clear.

  1. Do we need a seperate function, like detectChineseTofu(), just as I did in the previous patch? If so, in which scene will it be called?
  1. Or it's an improvement on the old detectTofu() to make it applicable to Chinese. If so, may I just cover the old solution, as the new one (comparing image) will work for almost all languages. Though it's slower than only comparing widths and heights, a unified solution looks much simpler.
dchan added a comment.Mar 28 2014, 4:35 PM

(2) is correct. Your method is more precise and works for more languages. However the old method is faster[*], and completely reliable if it returns false. Therefore we should do the following pseudo-code:

function detectTofu ( text ) {

maybeTofu = <old technique>;
if ( maybeTofu ) {
    isTofu = <new technique>;
} else {
    isTofu = false;
}
return isTofu;

}

  • I *presume* the old method is faster, but I have not actually tested this. Feel free to do so and to post actual numbers here!

xiaoxiangquan wrote:

(In reply to David Chan from comment #8)

(2) is correct. Your method is more precise and works for more languages.
However the old method is faster[*], and completely reliable if it returns
false. Therefore we should do the following pseudo-code:
function detectTofu ( text ) {

maybeTofu = <old technique>;
if ( maybeTofu ) {
    isTofu = <new technique>;
} else {
    isTofu = false;
}
return isTofu;

}

Hi, how about a sentence only contains 1 tofu, which is common in languages like Chinese?
detectTofu(text) will return true in such situation. Is that correct?

BTW, I'll test the performance of both techniques and post result here :)

Change 122277 had a related patch set uploaded by Xiaoxiangquan:
uls: Improve detectTofu algorithm to detect fixed-width glyphs

https://gerrit.wikimedia.org/r/122277

xiaoxiangquan wrote:

(In reply to Gerrit Notification Bot from comment #10)

Change 122277 had a related patch set uploaded by Xiaoxiangquan:
uls: Improve detectTofu algorithm to detect fixed-width glyphs
https://gerrit.wikimedia.org/r/122277

Sorry I havn't setup a testing-environment well ( trying vagrant currently ), so it's not well tested. I tried to make it bug free.

Change 153375 had a related patch set uploaded (by Nemo bis):
Detect tofu with the specified font family, and display popup when click the tofu

https://gerrit.wikimedia.org/r/153375

Change 122277 had a related patch set uploaded (by Nemo bis):
uls: Improve detectTofu algorithm to detect fixed-width glyphs

https://gerrit.wikimedia.org/r/122277

(Removing me from reviewers)-see notes about generic font detection experiments at https://www.mediawiki.org/wiki/Universal_Language_Selector/WebFonts#Font_detection

I'm not sure I understand this comment, but it seems to imply that the approach attempted by the patch may not be interesting to the ULS maintainers. If so, perhaps the patch needs to be abandoned? The linked page says that the approach to use is:

A special blank font named Tofu is being attempted by Behdad Esfahbod with a few bytes size. But the technology used for that is very advanced and current browsers do not support it.

However the discussion linked is quite messy and I don't understand what technology is being referred to. Does someone know? It would be nice to check what's the browser support for it now, one year later.

Qgil removed a subscriber: Qgil.Dec 28 2015, 1:31 PM

@xiaoxiangquan, ping. Do you plan to work on this further? If so do you need help from us? If not , I recommend abandon the patch since it is too old and we are trying to minimize the webfonts features going forward and gradually retire it where they are not really necessary now a days.

ping @dchan too, Since he was mentoring this.

@xiaoxiangquan, ping. Do you plan to work on this further?

That username no longer exists, you probably have to write on their user talk page.

we are trying to minimize the webfonts features going forward and gradually retire it where they are not really necessary

Isn't the purpose of tofu detection precisely to determine when webfonts are needed? How is this report in contrast with your stated goal?

Change 122277 abandoned by Nikerabbit:
uls: Improve detectTofu algorithm to detect fixed-width glyphs

Reason:
This patch has been stale for a long time. Language team does not have time to work on this patch; we are trying to reduce the long term maintenance efforts needed for the web fonts feature. If someone wants to work on this, please ask to unabandon or submit as a new patch.

https://gerrit.wikimedia.org/r/122277

Change 153375 abandoned by Nikerabbit:
Detect tofu with the specified font family, and display popup when click the tofu

Reason:
This patch has been stale for a long time. Language team does not have time to work on this patch; we are trying to reduce the long term maintenance efforts needed for the web fonts feature. If someone wants to work on this, please ask to unabandon or submit as a new patch.

https://gerrit.wikimedia.org/r/153375

So both patches are abandoned, maybe it's not worth to decline?

The detectTofu function only working on Chrome, can't working on Firefox. Firefox uses a different way to render tofu glyphs