Page MenuHomePhabricator

Support 'utf8mb4' character set in MySQL 5.5 and above
Closed, DeclinedPublic

Description

The 'utf8' character set in MySQL does not support characters above
U+FFFF, which take up four bytes. (An example of such a character is
U+1D49E MATHEMATICAL SCRIPT CAPITAL C ("𝒞"), encoded as F0 9D 92 9E.)

Yet the web installer so prominently offers the "UTF-8" option, despite
this serious limitation. Perhaps MediaWiki should support the 'utf8mb4'
character set in MySQL 5.5 and above, in which that option is available.

Mailing list discussion that prompted me to file this bug:
http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069552.html

Details

Reference
bz48767

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:27 AM
bzimport set Reference to bz48767.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

The 'utf8' character set in MySQL does not support characters above
U+FFFF, which take up four bytes. (An example of such a character is
U+1D49E MATHEMATICAL SCRIPT CAPITAL C (

Well, that's broken. Remainder of description:

), encoded as F0 9D 92 9E.)

Yet the web installer so prominently offers the "UTF-8" option, despite
this serious limitation. Perhaps MediaWiki should support the 'utf8mb4'
character set in MySQL 5.5 and above, in which that option is available.

Mailing list discussion that prompted me to file this bug:
http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069552.html

What collation would MediaWiki use for the utf8mb4character set? I assume it'd have to be the binary collation utf8mb4_bin, but it'd be good to clarify this. Sadly case- and accent-sensitive collations for Unicode character sets will only be available with MySQL 8.0 😞

I started the conversation on T194125 for a general question on how to move forward. I think there is 2 possibilities, embrace it or consolidate on binary only, most developers seem to prefer for now the second option (and I can understand unifying on the most complete ones would be preferred), but that also opens other questions, like how to support existing installations/upgrade them. Please add your thoughts there.

Note I do not have a horse on this race, but obviously I am interested on the outcome as a DBA.

It seems that the decision is going to be, as of now, is to stop supporting utf8 (the mysql option) for new installs, which, as the original reporter noticed, is buggy as it does not support the full range of UTF-8 (4 byte) encodings. However, the way to do fix the issue is to use the already existing binary collation, which already supported all encodings without being constrained by a particular configuration or collation.

Technically, this will solve the proposed issue (allowing UTF-8), but as it was worded as "Support utf8mb4", I will close it as declined, based on discussion at T194125. I hope this decision will make most people happy will result in better compatibility with newer mysql versions (as binary is equally well supported on old and newer versions).

I stumbled upon this report when trying to solve my MediaWiki 1.31 utf8 issues with emojis. My MariaDB tables mostly had the utf8 (utf8mb3) encoding active causing the problem. I realized that the decision was made to recommend/support binary encodings of text column for MariaDB/MySQL only (containing UFT-8 raw data). Despite I would have preferred to switch to utf8mb4, I will stick to that recommendation and changed my columns to the up-to-date MediaWiki SQL schema.

Using ALTER TABLE table CONVERT TO CHARACTER SET binary; on all tables does not lead to a schema identical to the one shipped with MediaWiki (e.g. varchar(15) binary vs. varbinary(15)).

Instead of changing the columns manually for every table I successfully automated the process and shared the description/tool at https://www.mediawiki.org/wiki/Manual_talk:$wgDBmysql5 hoping that other people having to solve the same problem save some time (tested for 1.31 but the solution is not depending on a specific MediaWiki version).