Page MenuHomePhabricator

Epic: Support custom Han characters on Chinese Wikisource
Open, MediumPublic

Description

Also proposed in Community-Wishlist-Survey-2016. Received 53 support votes, ranked #20 out of 265 proposals. View full proposal with discussion and votes here

Han-characters are widely used in East Asia (China, Taiwan, Singapore, Malaysia, Hong Kong, Japan, Korea, Taiwan and Vietnam). An enduring problem unsolved for digital archiving is "lacking of characters". Not only for characters in ancient books, even modern publications lacks for characters ( i.e. Some authors may created 300-400 unique new characters in certain books). It's difficult to deal when we archive them into WikiSource. Unicode gradually add new characters into the chart, but new Uni-han extension always takes time to go live. In the past WikiSource, even Wikipedia, used to deal this problem with image files to present those characters. But images cannot be indexed, unsearchable, even not exchangeable between computer systems.

Unicode IDS - Ideographic Description Sequence - defined how to composite Han character with components. We implement the function to dynamically render Han character with Ideographic Description Sequences (IDS) and extension in WikiSource like: <ids>⿺辶⿴宀⿱珤⿰隹⿰貝招</ids> It will generate a Han character image file (now rendered on the temporary server on wmflabs ) with IDS in metadata. This is a solution to resolve lacking of Han characters problem on all C/J/K/V books. The basis is that Han characters are not as the same level as European alphabets, but words. Han characters are an open set. They are composited on 2 dimension by more basic components which owns basic element, like "affix" in English (English words are composite on 1 dimension). In academies, components based Han character composite technology are developed and adapted to handle ancient Han books. The most famous are Academia Sinica's development and cbeta Sutras plan. Recent years, opensource IDS renders are developed stable, so we can use the same technology to benefit Wikisource for handling Han ancient books as the same as those academies.

Event Timeline

kaldari triaged this task as Medium priority.Dec 23 2016, 11:05 PM
kaldari added a subscriber: Shoichi.
Aklapper added a subscriber: Shizhao.

@Shizhao: Why did you add the Wikisource-Community-User-Group tag? If you think this is a non-coding task, please elaborate... Reverting.

@Shizhao: Did you talk to the Community-Tech team before adding that tag?

@Shizhao: Did you talk to the Community-Tech team before adding that tag?

I'm afraid she doesn't...

This task was proposed in the Community-Wishlist-Survey-2016 and in its current state needs owner. Wikimedia is participating in Google Summer of Code 2017 and Outreachy Round 14. To the subscribers -- would this task or a portion of it be a good fit for either of these programs? If so, would you be willing to help mentor this project? Remember, each outreach project requires a minimum of one primary mentor, and co-mentor.

I'm not sure how likely it is that the rendering engine will be security-reviewed any time soon, so is it an option to move ahead with deploying the IDS extension and for it to continue to use the existing Tool Labs rendering service?

This would require a caching layer to be added to the extension, so that not every request is resulting in a request to Labs. Is this something that we should work on? If the rendering engine is moved onto a production server (T148693) some time in the future, having a in-wiki caching system would still be worthwhile.

I don't think we want to have a production extension dependent on a Tool Labs service. It would probably make sense to set the service up on the scaling cluster (similar to graphoid), i.e. sca1XXX in eqiad. It would need to be security reviewed first.

Makes sense.

In that case, it sounds like things might be waiting on the security review. @Shoichi has added some translations (comments only, i think) to the Java code, but perhaps there's more to do. The plan is not currently to translate the whole codebase, but just to add English comments throughout. Is this going to be sufficient for reviewing?

Hello everyone, about renderer codereview, I posted https://phabricator.wikimedia.org/T154044

I'm not sure how likely it is that the rendering engine will be security-reviewed any time soon, so is it an option to move ahead with deploying the IDS extension and for it to continue to use the existing Tool Labs rendering service?

This would require a caching layer to be added to the extension, so that not every request is resulting in a request to Labs. Is this something that we should work on? If the rendering engine is moved onto a production server (T148693) some time in the future, having a in-wiki caching system would still be worthwhile.

After researching (also discussion with upstream author), about cache, the good solution is putting a Squid in front of IDS rendering server. Just use Squid as the cache server. Cache putting in server side,will make sense : multi wiki sites requests may highly repeat. It is possible that some missing character may be highly used in different sites. Caching in server side should be better than caching in each wiki sites by themselves.

IDS_scheme.png (800×1 px, 74 KB)

After researching (also discussion with upstream author), about cache, the good solution is putting a Squid in front of IDS rendering server. Just use Squid as the cache server. Cache putting in server side,will make sense : multi wiki sites requests may highly repeat. It is possible that some missing character may be highly used in different sites. Caching in server side should be better than caching in each wiki sites by themselves.

This probably isn't a realistic option as there aren't any caching servers available for Tool Labs.

This probably isn't a realistic option as there aren't any caching servers available for Tool Labs.

I think the idea would be to have Squid in front of a production IDS rendering server. Which would work, I think? If so, then really the big next step in getting this resolved is to finish translating the code in han3_ji7_tsoo1_kian3 and get it read for security review. (Same for the extension, but it's so simple — especially if it doesn't need to incorporate caching — and it can be done after; the Java bit is the hard bit.)

This probably isn't a realistic option as there aren't any caching servers available for Tool Labs.

I think the idea would be to have Squid in front of a production IDS rendering server. Which would work, I think? If so, then really the big next step in getting this resolved is to finish translating the code in han3_ji7_tsoo1_kian3 and get it read for security review. (Same for the extension, but it's so simple — especially if it doesn't need to incorporate caching — and it can be done after; the Java bit is the hard bit.)

I think I have problem, about T148693, My team have translated almost the whole web/net source code of han3_ji7_tsoo1_kian3 . The leftover are about graphics rendering,so can someone do security review first? Where should I apply?

so can someone do security review first? Where should I apply?

See https://www.mediawiki.org/wiki/Review_queue#Preparing_for_deployment

Thank you. I am going to study the procedure ,and go to next step.