Page MenuHomePhabricator

Korean Hanja to Hangul unidirectional conversion tool on Korean wikisource?
Closed, InvalidPublic

Description

Some previous discussions at https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Hanja_2

As described in various related pages, it is difficult to make a working conversion engine that can convert from Hangul to Hanja with high correct rate, (One example of a WIP converter have been linked but its correctness is still limited), and when a unidirectional conversion proposal was suggested in kowp community, many opposed on grounds like editability and amount of people that can understand Hanja in kowp editing community which make it hard to check and edit.

However, as described in the linked discussion, there are already numerous Korean text being stored in the wikisource being available in both Hanja and Hangul edition. The tool would be helpful to Korean Wikisource in merging duplicated pages.

Conversion data are available in Unihan database kHangul value for each ideographic characters.

Note that, a few Ideographic characters could have multiple readings. According to my knowledge, in Korean encoding as well as the round trip compatible Unicode encoding, those characters are separately encoded so that a direct one to one conversion should not be problematic in most of the case, however, it seems like some of these characters could have being normalized into one single code point in Mediawiki? If that is the case and is still desired in future then some additional rules will be needed.

Note that, the main goal of such a tool is to convert Hangul-Hanja mixed script documents into Hangul-only text. Transliteration between Classical Chinese and Korean are out of scope of the task for now. Text that are written with the Idu system ([[:w:en:idu script]]) might be within the scope of the task.

Should also ask Korean Wikisource community about it.

Event Timeline

  1. The linked discussion was about Hanmun = Classical Chinese documents, not documents written in Hanja-Hangul mixed script.
  2. The task is closed as it seems like it is not a good way to word this request this way for now. Will probably make a post on the community discussion page when I word it in a better way.