Page MenuHomePhabricator

CHAR_LENGTH does not return the character count
Closed, ResolvedPublic

Description

In MySQL, CHAR_LENGTH returns the character count and LENGTH returns the byte count. For a Chinese character (UTF-8 encoding), the former is supposed to return 1 while the latter returns 3. But in queries like this, CHAR_LENGTH acts just like LENGTH and returns the byte count.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 30 2017, 11:25 AM
jcrespo closed this task as Resolved.Sep 5 2017, 5:22 PM
jcrespo claimed this task.
jcrespo added a subscriber: jcrespo.

See:
https://quarry.wmflabs.org/query/21367 which counts by characters, not by bytes.

Mediawiki on WMF-hosted wikis uses Binary collation. There are many reasons for this, from legacy, to better support of newer unicode standards. Right now there is not easy way to count characters on the mediawiki database. CHAR_LENGTH would be the right way, but because of this, it is as you say, completely equivalent to LENGTH (it counts the number of bytes, not the unicode characters).

There are 2 ways to solve this:

  • Do not use mysql functions from that, use programming language functions and assume the characters are utf8. On PHP that is somthing like iconv_strlen($str,'UTF-8'), I think. That is what mediawiki does on WMF, more or less.
  • Convert to utf8mb4 on the fly, and use char_length there: SELECT page_title, page_is_redirect, CHAR_LENGTH(CONVERT(page_title using utf8mb4)), LENGTH(page_title) FROM page WHERE page_namespace = 0 ORDER BY CHAR_LENGTH(CONVERT(page_title using utf8mb4)) DESC LIMIT 50;

See:
https://quarry.wmflabs.org/query/21367 which counts by characters, not by bytes.

Note that CHAR_lenght will most likely invalidate the usage of indexes, no matter the encoding.

So in general this is a "won't fix" (but workarounds given above, but I marked it as resolved because technically the above second option does what you want, more or less, but it is not possible to change it on the labsdbs- it comes like that from production.

Base added a subscriber: Base.Sep 5 2017, 5:56 PM

@jcrespo , is this why I also fail to get normal results while attempting to match title against a regex? https://quarry.wmflabs.org/query/21026

is this why I also fail to get normal results while attempting to match title against a regex?

I cannot say, I would tell you to try if it helps :-) Some of the functions are utf-8 aware, while other are just the low level C-equivalent that only handles bytes. Check the manual to know which one is the right answer.