Description

Investigation card for wish #20: Create new Han characters with IDS extension for Wikisource

The main ticket is T137786: Deploy IDS extension to zh.wikisource.

The investigation is -- what still needs to be done? Several people have been working on it. How can we help to get it finished?

Investigation

Background:

Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 10 individual Chinese characters. Together they make sense as a new phrase or sentence. Since such new ideographs are created all the time (like how new phrases are created using standard words), it's not feasible to think about adding them all to Unicode.
Adding IDS support to wikis means that rather than using manually-created images to represent these ideographs, contributors will be able to add them directly from the edit interface. IDS operates via 12 special 'operator' unicode characters that prefix other characters to create new ideographs.
The IDS extension works by sending these character combinations (defined within an <ids>…</ids> element) to a web service that returns a PNG of the resulting ideograph.

Current situation:

Active development is underway (most recent activity a couple of weeks ago) by users including @Shoichi and @awight
The (GPL-2.0) extension code is at https://github.com/MGdesigner/Mediawiki-IDSextension — it's only about 25 lines of code
The (Java, AGPL-3.0) web service tool is at https://github.com/sih4sing5hong5/han3_ji7_tsoo1_kian3 (27 open issues, but most are from within about the last 18 months)
- A fork of this is maintained by Wikimedia Taiwan, but it's currently the same as upstream
- Much of han3_ji7_tsoo1_kian3 is in Chinese, which is a barrier to non-Sinophone developers — however, it sounds like the upstream author has agreed that things should be translated to English instead
A test installation of the web service is running at https://tools.wmflabs.org/idsgen/
The mediawiki/extensions/Ids repository has been requested by @awight
A test wiki has been set up at http://ids-testing.wmflabs.org/wiki/

Still required:

T153989: Get mirror of IDS Extension repository set up in Gerrit/Diffusion
Set up an IDS Phabricator project — IDS-extension
Security (and code style etc.) review of the extension
Add ability to configure the web service endpoint to the extension (hardcoded at the moment)
Translate han3_ji7_tsoo1_kian3 into English
Security (etc.) review of han3_ji7_tsoo1_kian3
T137786: Deploy IDS extension to zh.wikisource
T148693: Deploy IDS rendering engine to production (possibly)
Add a caching layer as part of the extension (if T148693 isn't practical)

		Status	Subtype	Assigned	Task
		Open		None	T154044 Epic: Support custom Han characters on Chinese Wikisource
		Resolved		Samwilson	T153796 Investigation: Create new Han characters with IDS extension for Wikisource

Event Timeline

• DannyH created this task.Dec 20 2016, 9:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2016, 9:41 PM

• DannyH updated the task description. (Show Details)Dec 20 2016, 10:25 PM

kaldari set the point value for this task to 3.Dec 20 2016, 10:29 PM

kaldari moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.

Samwilson claimed this task.Dec 21 2016, 6:16 AM

Samwilson updated the task description. (Show Details)

Samwilson edited projects, added Community-Tech-Sprint; removed Community-Tech.

Samwilson moved this task from Ready to In Development on the Community-Tech-Sprint board.

Samwilson added subscribers: Shoichi, awight.

Samwilson added a project: IDS-extension.Dec 21 2016, 6:30 AM

Samwilson moved this task from Backlog to In development on the IDS-extension board.

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters. Together they make sense as a new phrase or sentence. Since such new ideographs are created all the time (like how new phrases are created using standard words), it's not feasible to think about adding them all to Unicode. Hence the idea is to generate the ideograph images on the fly using an <ids>../..</ids> element which takes unicode-compatible characters as input.

Samwilson updated the task description. (Show Details)Dec 21 2016, 6:36 AM

• Niharika updated the task description. (Show Details)Dec 21 2016, 6:41 AM

About Translate han3_ji7_tsoo1_kian3 into English for Security (etc.) review, I will set a team in this week. After discussion with the upstream author, my plan is give English translating by comments to each function name and variable name which use Han character( what a little difficult is those Han character including not only Mandarin but also Taiwanese. A little similar to English mixing German) . Then Security review can go. The translating will start from core of web service- servlet.

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

In T153796#2892171, @Niharika wrote:

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

In T153796#2892227, @Samwilson wrote:

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

Yes, after discussion with him, we found it's too complicated. I have agreement with him: he keeps using function name and variable name with using Han characters,and I set up a team to translate its meaning in comments by English(also translate other Mandarin or Taiwanese comments ).

I will start from making a branch in upstream ,and then clone it out. After we working get done, merge it back to upstream.

Actually speaking one thing you don't know....originally all its source file names are also Chinese,too. I translate to English for non-Chinese file system compatibility. (Some Python2 script problem ). After this discussion and testings ,he just gave up to fall back English filenames to Chinese names. XD

Shangkuanlc subscribed.Dec 21 2016, 7:17 AM

In T153796#2892228, @Shoichi wrote:

In T153796#2892171, @Niharika wrote:

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

Apologies! I recalled differently from my conversation with Liang earlier but I'm probably confusing this with another character he showed me. :)

Samwilson updated the task description. (Show Details)Dec 22 2016, 2:50 AM

@Niharika I added your summary to the investigation text.

@Shoichi Translation of comments sounds good. (By the way, what's the title of the tool? 'han3_ji7_tsoo1_kian3' sounds like a transliteration or abbreviation or something?)

If the tool is to be maintained by you and other people upstream then I can't see it being too much of a problem if it's harder work for English-only developers. Getting it deployed to production perhaps will require more in-depth security review, but as it doesn't need to store any data or anything maybe it'll be easy. Anyway, certainly it can live on Tool Labs as it currently is (which is what we're doing, for example, with the Google OCR system for Indic proofreading).

Samwilson moved this task from In Development to Needs Review/Feedback on the Community-Tech-Sprint board.Dec 22 2016, 3:09 AM

In T153796#2892350, @Niharika wrote:

In T153796#2892228, @Shoichi wrote:

In T153796#2892171, @Niharika wrote:

To explain a little more about what the extension is actually trying to do: Written Chinese involves "inventing" new characters all the time by combining a bunch of standard characters. For example this ideograph is a combination of 4 individual Chinese characters.

Excuse me, This one is combined by 10 individual characters. ^^!!

Apologies! I recalled differently from my conversation with Liang earlier but I'm probably confusing this with another character he showed me. :)

That's ok, @Niharika. What I showed you is the glyph composed by four unicode character "招財進寶"(zhāo cái jìn bǎo). There is a detailed explanation on this glyph in adobe blog, please see this link below for your reference - https://blogs.adobe.com/CCJKType/2009/01/diy.html .

@Shoichi Translation of comments sounds good. (By the way, what's the title of the tool? 'han3_ji7_tsoo1_kian3' sounds like a transliteration or abbreviation or something?)

@Samwilson: "han3 ji7 tsoo1 kian3" is the Taiwanese pronunciation for Han character "漢字組件(composites of Han characters)", "han3 ji7(漢字)" means Han characters, "tsoo1 kian3(組件)" means composites.

Still required:

How do screen-readers (etc.) handle IDS? The extension adds the input string (including the operator prefix character) as the alt attribute of the image.

@Samwilson: A bit confused by this last required item. Is there any action that needs to be taken or was this just a question that has been answered?

Sorry, yes, that was just a question and answer. All good as is.

kaldari closed this task as Resolved.Dec 22 2016, 11:47 PM

kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.

kaldari updated the task description. (Show Details)Dec 23 2016, 12:08 AM

kaldari updated the task description. (Show Details)Dec 23 2016, 12:11 AM

Samwilson updated the task description. (Show Details)Dec 23 2016, 12:31 AM

Samwilson moved this task from In development to Done on the IDS-extension board.Dec 23 2016, 12:47 AM

kaldari updated the task description. (Show Details)Dec 23 2016, 7:28 PM

kaldari added a parent task: T154044: Epic: Support custom Han characters on Chinese Wikisource.Dec 23 2016, 7:57 PM

kaldari updated the task description. (Show Details)

In T153796#2892239, @Shoichi wrote:

In T153796#2892227, @Samwilson wrote:

That sounds great @Shoichi! So it sounds like it isn't going to be necessary to maintain a separate fork of han3_ji7_tsoo1_kian3; is that correct? (That'd be best, of course!)

Yes, after discussion with him, we found it's too complicated. I have agreement with him: he keeps using function name and variable name with using Han characters,and I set up a team to translate its meaning in comments by English(also translate other Mandarin or Taiwanese comments ).

I will start from making a branch in upstream ,and then clone it out. After we working get done, merge it back to upstream.

Actually speaking one thing you don't know....originally all its source file names are also Chinese,too. I translate to English for non-Chinese file system compatibility. (Some Python2 script problem ). After this discussion and testings ,he just gave up to fall back English filenames to Chinese names. XD

Translation team has organized by me on 12/22. We have 5 people. We will start soonly, network service codes of han3_ji7_tsoo1_kian3 are in high priority.

In T153796#2897239, @kaldari wrote:

Still required:

How do screen-readers (etc.) handle IDS? The extension adds the input string (including the operator prefix character) as the alt attribute of the image.

@Samwilson: A bit confused by this last required item. Is there any action that needs to be taken or was this just a question that has been answered?

The input string is just used IDS as "encoding", you can think IDS as a new "Unicode code point". About screen-readers , if you know how it pronunciates ,in the future just set a pronunciation to its IDS. But, actually speaking, Han characters are ideographic, very very visual. Not every one can be pronunciated. Even tens of thousands Han Characters already coded in Unicode have no pronunciations. Han Characters which can be pronunciated may be 5000-20000 (Depend on which language.). I think it's another giant subject to decide how they should be pronunciated. ( Han Characters in unicode or not in unicode, Now there are almost 400,000+ in the world)

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Jan 3 2017, 7:41 PM

• DannyH moved this task from Up Next (May 6-17) to Archive on the Community-Tech board.Jan 3 2017, 8:06 PM

Investigation: Create new Han characters with IDS extension for Wikisource
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

Investigation

Related Objects
Search...

Event Timeline

Investigation: Create new Han characters with IDS extension for WikisourceClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Investigation

Related ObjectsSearch...

Event Timeline

Investigation: Create new Han characters with IDS extension for Wikisource
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...