Page MenuHomePhabricator

Fix bengali tokenization in deltas
Closed, ResolvedPublic

Description

+    cache = {r_text: "দেখার পর তিনি চ্চিত্র worngly."}
+    eq_(solve(bengali.dictionary.revision.datasources.dict_words,
+              cache=cache),
+       ['দ', 'খ', 'র', 'পর', 'ত', 'ন', 'চ', 'চ', 'ত', 'র'])

That looks wrong.

Event Timeline

Halfak created this task.May 8 2017, 4:58 PM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMay 8 2017, 4:58 PM
Halfak claimed this task.May 8 2017, 4:58 PM
Halfak removed a project: User-Ladsgroup.
Halfak added a subscriber: Ladsgroup.
Aftabuzzaman added a comment.EditedMay 8 2017, 6:35 PM

I don't know what is the problem. Just in case you want to know: দেখার পর তিনি চ্চিত্র => দ+ে+খ+া+র+ space +প+র+ space +ত+ি+ন+ি+ space +চ+্+চ+ি+ত+্+র

Halfak added a comment.May 8 2017, 6:44 PM

Right. I think I need to account for those signing chars for bengali when doing work tokenization. We had a similar problem with hindi and arabic/persian.

Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Peachey88 reopened this task as Open.May 9 2017, 10:12 AM
Halfak closed this task as Resolved.Jun 5 2017, 5:07 PM