Page MenuHomePhabricator

Investigation: Create new Han characters with IDS extension for Wikisource
Closed, ResolvedPublic3 Estimated Story Points

Description

Investigation card for wish #20: Create new Han characters with IDS extension for Wikisource

The main ticket is T137786: Deploy IDS extension to zh.wikisource.

The investigation is -- what still needs to be done? Several people have been working on it. How can we help to get it finished?

See also https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Wikisource#Create_new_Han_Characters_with_IDS_extension_for_WikiSource


Investigation

Background:

  • Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 10 individual Chinese characters. Together they make sense as a new phrase or sentence. Since such new ideographs are created all the time (like how new phrases are created using standard words), it's not feasible to think about adding them all to Unicode.
  • Adding IDS support to wikis means that rather than using manually-created images to represent these ideographs, contributors will be able to add them directly from the edit interface. IDS operates via 12 special 'operator' unicode characters that prefix other characters to create new ideographs.
  • The IDS extension works by sending these character combinations (defined within an <ids>…</ids> element) to a web service that returns a PNG of the resulting ideograph.

Current situation:

Still required:

Event Timeline

kaldari set the point value for this task to 3.Dec 20 2016, 10:29 PM
kaldari moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.
Samwilson updated the task description. (Show Details)
Samwilson edited projects, added Community-Tech-Sprint; removed Community-Tech.
Samwilson moved this task from Ready to In Development on the Community-Tech-Sprint board.
Samwilson added subscribers: Shoichi, awight.

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters. Together they make sense as a new phrase or sentence. Since such new ideographs are created all the time (like how new phrases are created using standard words), it's not feasible to think about adding them all to Unicode. Hence the idea is to generate the ideograph images on the fly using an <ids>../..</ids> element which takes unicode-compatible characters as input.

About Translate han3_ji7_tsoo1_kian3 into English for Security (etc.) review, I will set a team in this week. After discussion with the upstream author, my plan is give English translating by comments to each function name and variable name which use Han character( what a little difficult is those Han character including not only Mandarin but also Taiwanese. A little similar to English mixing German) . Then Security review can go. The translating will start from core of web service- servlet.

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

Yes, after discussion with him, we found it's too complicated. I have agreement with him: he keeps using function name and variable name with using Han characters,and I set up a team to translate its meaning in comments by English(also translate other Mandarin or Taiwanese comments ).

I will start from making a branch in upstream ,and then clone it out. After we working get done, merge it back to upstream.

Actually speaking one thing you don't know....originally all its source file names are also Chinese,too. I translate to English for non-Chinese file system compatibility. (Some Python2 script problem ). After this discussion and testings ,he just gave up to fall back English filenames to Chinese names. XD

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

Apologies! I recalled differently from my conversation with Liang earlier but I'm probably confusing this with another character he showed me. :)

@Niharika I added your summary to the investigation text.

@Shoichi Translation of comments sounds good. (By the way, what's the title of the tool? 'han3_ji7_tsoo1_kian3' sounds like a transliteration or abbreviation or something?)

If the tool is to be maintained by you and other people upstream then I can't see it being too much of a problem if it's harder work for English-only developers. Getting it deployed to production perhaps will require more in-depth security review, but as it doesn't need to store any data or anything maybe it'll be easy. Anyway, certainly it can live on Tool Labs as it currently is (which is what we're doing, for example, with the Google OCR system for Indic proofreading).

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

Apologies! I recalled differently from my conversation with Liang earlier but I'm probably confusing this with another character he showed me. :)

That's ok, @Niharika. What I showed you is the glyph composed by four unicode character "招財進寶"(zhāo cái jìn bǎo). There is a detailed explanation on this glyph in adobe blog, please see this link below for your reference - https://blogs.adobe.com/CCJKType/2009/01/diy.html .

@Shoichi Translation of comments sounds good. (By the way, what's the title of the tool? 'han3_ji7_tsoo1_kian3' sounds like a transliteration or abbreviation or something?)

@Samwilson: "han3 ji7 tsoo1 kian3" is the Taiwanese pronunciation for Han character "漢字組件(composites of Han characters)", "han3 ji7(漢字)" means Han characters, "tsoo1 kian3(組件)" means composites.

Still required:

  • How do screen-readers (etc.) handle IDS? The extension adds the input string (including the operator prefix character) as the alt attribute of the image.

@Samwilson: A bit confused by this last required item. Is there any action that needs to be taken or was this just a question that has been answered?

Sorry, yes, that was just a question and answer. All good as is.

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

Yes, after discussion with him, we found it's too complicated. I have agreement with him: he keeps using function name and variable name with using Han characters,and I set up a team to translate its meaning in comments by English(also translate other Mandarin or Taiwanese comments ).

I will start from making a branch in upstream ,and then clone it out. After we working get done, merge it back to upstream.

Actually speaking one thing you don't know....originally all its source file names are also Chinese,too. I translate to English for non-Chinese file system compatibility. (Some Python2 script problem ). After this discussion and testings ,he just gave up to fall back English filenames to Chinese names. XD

Translation team has organized by me on 12/22. We have 5 people. We will start soonly, network service codes of han3_ji7_tsoo1_kian3 are in high priority.

Still required:

  • How do screen-readers (etc.) handle IDS? The extension adds the input string (including the operator prefix character) as the alt attribute of the image.

@Samwilson: A bit confused by this last required item. Is there any action that needs to be taken or was this just a question that has been answered?

The input string is just used IDS as "encoding", you can think IDS as a new "Unicode code point". About screen-readers , if you know how it pronunciates ,in the future just set a pronunciation to its IDS. But, actually speaking, Han characters are ideographic, very very visual. Not every one can be pronunciated. Even tens of thousands Han Characters already coded in Unicode have no pronunciations. Han Characters which can be pronunciated may be 5000-20000 (Depend on which language.). I think it's another giant subject to decide how they should be pronunciated. ( Han Characters in unicode or not in unicode, Now there are almost 400,000+ in the world)