Page MenuHomePhabricator

Scribunto does not understand Thai number as input in some aspects
Closed, ResolvedPublicBUG REPORT

Description

I am trying to write a Scribunto module to convert Thai numbers to Arabic numbers, but it always fails and debug module does not understand the number at all.

Scribunto still passes Thai numbers as input straight through to the output successfully (without any work done to it), so this might be a problem with successfully matching characters to the same character in the table.
more edit: string.gsub() successfully convert the numbers, but still keeping this task open because the original method should have worked too.

See https://th.wikisource.org/wiki/Module_talk:ThaiToArabicNum/testcases test_2 and test_3 to see how it does not work, and https://th.wikisource.org/wiki/Module:ThaiToArabicNum/sandbox is the code.

List of steps to reproduce (step by step, including full links if applicable): go to https://th.wikisource.org/wiki/Module_talk:ThaiToArabicNum/testcases test_2 and test_3
What happens?: Thai to Arabic number conversion completely fails because Scribunto does not understand Thai number as input
What should have happened instead?: Thai to Arabic number conversion passes (like the script not in sandbox)
Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.: Thai Wikisource

Event Timeline

Bebiezaza renamed this task from Scribunto does not understand Thai number as input to Scribunto does not understand Thai number as input in some aspects.Apr 25 2022, 5:39 PM
Nullzero claimed this task.
Nullzero subscribed.

This works as intended. Lua doesn't have a good support for unicode compared to other programming languages. As indicated in http://lua-users.org/wiki/LuaUnicode:

Lua's pattern matching facilities work byte by byte. In general, this will not work for Unicode pattern matching

The page suggests a workaround: looping over UTF-8 could be done with the following:

for uchar in string.gmatch("๑๒๓", "([%z\1-\127\194-\244][\128-\191]*)") do
  print(uchar)
end

which produces:

๑
๒
๓

Anyway, closing this as resolved, as we have a functioning code already.