Page MenuHomePhabricator

Armenian has low sentence performance due to use of standard colon in Flores data
Open, Needs TriagePublic

Description

The library correctly handles the Armenian full stop (\u0589 in https://www.unicode.org/charts/PDF/U0530.pdf which is ։) but the Flores data uses a normal colon most of the time, which looks similar but obviously breaks our approach. Options:

  • Document: Just caveat the "low" performance in a README somewhere so folks are aware.
  • Cover-up: Convert the colons to the official character in our dataset. I'd hesitate about this though because this usage of standard colons might appear in other language data.
  • Fix: Make language-specific exceptions like colons for Armenian (presumably there are other languages where this sort of thing happens). Not sure how complicated this would be though.