Page MenuHomePhabricator

Identify all languages that don't have spaces
Closed, ResolvedPublic

Description

We've already identified: ja (Japanese), zh (Chinese), th (Thai), and km (Khmer)

Need to survey the rest of the languages we have projects in and identify any others that would likely be effected by the query changes we've made along with BM25.

Event Timeline

There are also the Chinese variants:

  • gan (Gan Chinese)
  • hak (Hakka Chinese)
  • wuu (Wu Chinese)
  • zh-classical (Classical Chinese)
  • zh-yue (Yue Chinese)

And also:

  • bo (Tibetan)
  • dz (Dzongkha)
  • lo (Laotian)

That should be all, but I might have missed a couple.

@jhsoby Thanks for these! I'll make sure to include them in the final list.

My full write up is on MediaWiki.

These languages/projects use primarily spaceless writing systems:

  • Languages: Tibetan, Dzongkha, Gan, Japanese, Khmer, Lao, Burmese, Thai, Wu, Chinese, Classical Chinese, Cantonese
  • Codes: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical, zh-yue

These languages/projects use a mix of writing systems that do and don't use spaces:

  • Languages: Buginese, Min Dong, Cree, Hakka, Javanese, Min Nan
  • Codes: bug, cdo, cr, hak, jv, zh-min-nan

More details about the mixed projects are available in my write up, but briefly, the Chinese languages listed here (Min Dong, Hakka, Min Nan) use primarily a romanized version of the language on their Wikipedia, but with a number of pages in Chinese. Javanese Wikipedia primarily uses a Latin script, but the Javanese Wiktionary includes many items in the spaceless Javanese script.

While working on this, I realized that we have not addressed languages that are polysynthetic of agglutinative. Regardless of writing system (some use Latin script, for example), these languages can have very long and complex words that can convey the meaning of a whole English sentence. Searching for individual units of meaning within them is very challenging, though probably beyond the scope of our current effort.