Page MenuHomePhabricator

Support 'utf8mb4' character set in MySQL 5.5 and above
Closed, DeclinedPublic

Description

The 'utf8' character set in MySQL does not support characters above
U+FFFF, which take up four bytes. (An example of such a character is
U+1D49E MATHEMATICAL SCRIPT CAPITAL C ("𝒞"), encoded as F0 9D 92 9E.)

Yet the web installer so prominently offers the "UTF-8" option, despite
this serious limitation. Perhaps MediaWiki should support the 'utf8mb4'
character set in MySQL 5.5 and above, in which that option is available.

Mailing list discussion that prompted me to file this bug:
http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069552.html

Details

Reference
bz48767

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:27 AM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz48767.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

The 'utf8' character set in MySQL does not support characters above
U+FFFF, which take up four bytes. (An example of such a character is
U+1D49E MATHEMATICAL SCRIPT CAPITAL C (

Well, that's broken. Remainder of description:

), encoded as F0 9D 92 9E.)

Yet the web installer so prominently offers the "UTF-8" option, despite
this serious limitation. Perhaps MediaWiki should support the 'utf8mb4'
character set in MySQL 5.5 and above, in which that option is available.

Mailing list discussion that prompted me to file this bug:
http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069552.html

Jdforrester-WMF set Security to None.
TK-999 added a subscriber: TK-999.EditedMar 19 2018, 8:51 PM

What collation would MediaWiki use for the utf8mb4character set? I assume it'd have to be the binary collation utf8mb4_bin, but it'd be good to clarify this. Sadly case- and accent-sensitive collations for Unicode character sets will only be available with MySQL 8.0 😞

I started the conversation on T194125 for a general question on how to move forward. I think there is 2 possibilities, embrace it or consolidate on binary only, most developers seem to prefer for now the second option (and I can understand unifying on the most complete ones would be preferred), but that also opens other questions, like how to support existing installations/upgrade them. Please add your thoughts there.

Note I do not have a horse on this race, but obviously I am interested on the outcome as a DBA.

jcrespo closed this task as Declined.May 31 2018, 9:40 PM

It seems that the decision is going to be, as of now, is to stop supporting utf8 (the mysql option) for new installs, which, as the original reporter noticed, is buggy as it does not support the full range of UTF-8 (4 byte) encodings. However, the way to do fix the issue is to use the already existing binary collation, which already supported all encodings without being constrained by a particular configuration or collation.

Technically, this will solve the proposed issue (allowing UTF-8), but as it was worded as "Support utf8mb4", I will close it as declined, based on discussion at T194125. I hope this decision will make most people happy will result in better compatibility with newer mysql versions (as binary is equally well supported on old and newer versions).