Page MenuHomePhabricator

Get rid of UTF-8 encoded as latin-1
Open, Needs TriagePublic

Description

Problem

(copy and paste from https://www.mediawiki.org/wiki/Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1 )
Versions of MySQL before 5 did not have binary nor UTF-8 support (or at least nobody cared). A common workaround was to encode the text as a UTF-8 byte sequence and give that to MySQL which read the byte in as if they were latin-1 characters. This worked alright until MySQL and VARCHAR became more Unicode aware. Now depending on settings, it may convert the latin-1 bytes to Unicode code points then encode to your specified encoding. So è (U+00E8, UTF-8: c3 a8) becomes è (U+00C3 U+00A8, UTF-8: c3 83 c2 a8), the equivalent python transformation is u'è'.encode('utf-8').decode('latin').encode('utf-8'). This double encoding is the reason why JOINs will fail with the newer VARBINARY fields used everywhere by MediaWiki now.

This is quite complicated and annoying to work with. Source of bugs.

Who would benefit

Anyone who uses database access on Toollabs to do queries.

Proposed solution

Convert the remaining legacy field and tables so we don't have to do tricks like CONVERT( CONVERT( CONVERT( your_column USING latin1) USING binary) USING utf8); any more?

Event Timeline

Not sure. I'm looking a the database, that seems to be more about the software on top of it. I think that one is blocked by this one? Once everything has been converted the wgLegacyEncoding thing can be dropped?

Not sure. I'm looking a the database, that seems to be more about the software on top of it. I think that one is blocked by this one? Once everything has been converted the wgLegacyEncoding thing can be dropped?

Yeah, but the sub-tasks of that are:

Is that list supposed to be complete? What about Commons? Quite sure I ran into it there.

I'm confused. Which fields are we talking about? wgLegacyEncoding only applies to the text table/external storage, which isn't even accessible from tool labs.

@Multichill asked me to show the CREATE TABLE statement for categorylinks in commons, and here it is:

 CREATE TABLE `categorylinks` (
  `cl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `cl_to` varbinary(255) NOT NULL DEFAULT '',
  `cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
  `cl_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
  `cl_collation` varbinary(32) NOT NULL DEFAULT '',
  `cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
  UNIQUE KEY `cl_from` (`cl_from`,`cl_to`),
  KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
  KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`),
  KEY `cl_collation_ext` (`cl_collation`,`cl_to`,`cl_type`,`cl_from`)
) ENGINE=TokuDB DEFAULT CHARSET=binary `compression`='tokudb_zlib'

This was on labsdb1001. Not sure what this is for or if it is useful :)

Thanks @yuvipanda . In this case cl_sortkey_prefix varbinary(255) NOT NULL DEFAULT '', , if you look at https://www.mediawiki.org/wiki/Manual:Categorylinks_table you'll see it's supposed to be human readable and a varchar(255) binary. So never got converted...

In a random table (zuwiki.categorylinks) in new labsdb,

CREATE TABLE `categorylinks` (
  `cl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `cl_to` varbinary(255) NOT NULL DEFAULT '',
  `cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
  `cl_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
  `cl_collation` varbinary(32) NOT NULL DEFAULT '',
  `cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
  UNIQUE KEY `cl_from` (`cl_from`,`cl_to`),
  KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
  KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`),
  KEY `cl_collation_ext` (`cl_collation`,`cl_to`,`cl_type`,`cl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

Hope that's useful!

you'll see it's supposed to be human readable and a varchar(255) binary. So never got converted...

That is untrue, at least from the schema design perspective (I cannot say anything about the content itself).

This is the binary sortkey, that depending on $wgCategoryCollation may or may not be readable by a human (but should sort in correct order when comparing as a byte string)

varchar(255) binary == varbinary(255) for MySQL. If you create a table with varchar(255) binary, it gets automatically translated to varbinary. We use varchar when possible (and we should continue doing that), because mediawiki as a product allows selecting the encoding, but we use binary charset for maximum compatibility @ WMF servers:

mysql> CREATE TABLE test (col varchar(255) binary) CHARSET=binary;
Query OK, 0 rows affected (0.02 sec)

mysql> SHOW CREATE TABLE test;
+-------+--------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                     |
+-------+--------------------------------------------------------------------------------------------------+
| test  | CREATE TABLE `test` (
  `col` varbinary(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary |
+-------+--------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

In any case, we should never use utf8mb3 (utf8), but utf8mb4, and even then, it is lagging behind in compatibility (specially on older versions of MySQL, which we aim to be compatible with). I am sure the schema is ok.

IF there is a content problem (which should be consulted with language, who is in charge of maintaining that column), that means mediawiki, not operations. IF there is a need to convert content, a mantenance script has to be written, etc.-you only need to talk to operations by signaling it is ongoing on deployments to avoid concurrent maintenace.

Independently, Brion agrees with me: https://phabricator.wikimedia.org/T128150#2074605

@jcrespo so you're saying "CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)" is here to stay?

Thanks @yuvipanda . In this case cl_sortkey_prefix varbinary(255) NOT NULL DEFAULT '', , if you look at https://www.mediawiki.org/wiki/Manual:Categorylinks_table you'll see it's supposed to be human readable and a varchar(255) binary. So never got converted...

That doc page is not canonical and probably wrong. Maintenance/tables.sql is the canonical source for what things should be.

Are you saying that there is data in the cl_sortkey_prefix encoded as latin-1 instead of utf-8 NFC? If so,that is wrong - can you give an example?

.

is here to stay?

I do not know, I have the same question than Bawolff:

If so,that is wrong - can you give an example?

I do not know, I have not checked, that is why I stressed on the IF. Please tell us more about it.

I did a bit of digging. In the past I had to use "CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)" otherwise I would get junk. Something seems to have changed in the meantime. I created https://test.wikipedia.org/wiki/Test_category_encoding and did a simple query to see if cl_sortkey_prefix and "CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)" was the same. That was the case. I did the same on Commons and same result. My assumption is that someone fixed an encoding problem in "cl_sortkey_prefix". Every time a page gets saved, that gets overwritten so over time most of the encoding problems would disappear.

To test this I did this query:

CONNECT commonswiki_p commonswiki.labsdb;
SELECT cl_sortkey_prefix, CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8) FROM categorylinks WHERE 
NOT cl_sortkey_prefix=CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)
LIMIT 1000;

This yielded a couple of results:

page_namespace	page_title	cl_sortkey_prefix	CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)
14	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักรไทย_พุทธศักราช_๒๕๕๐	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร�	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร
14	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักรไทย_พุทธศักราช_๒๕๕๐_(with_sort_key)	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร�	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร
14	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักรไทย_พุทธศักราช_๒๕๕๐_(with_arabic_sort_key)	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร�	User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร

(pages like https://test.wikipedia.org/wiki/User:Ans/%E0%B9%91 , looks like someone was debugging this in 2009)

I did a null edit on two of these pages, reran the query and got:

page_namespace  page_title      cl_sortkey_prefix       CONVERT( CONVERT( CONVERT(cl_sortkey_prefix USING latin1) USING binary) USING utf8)
14      User:Ans/รัฐธรรมนูญแห่งราชอาณาจักรไทย_พุทธศักราช_๒๕๕๐_(with_arabic_sort_key)  User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร<E0><B9>  User:Ans/รัฐธรรมนูญแห่งราชอาณาจักร

I ran the same query on Commons and that gave only 721 results. So problem was solved, but the database tables never fully caught up. Does this make more sense to you?

looks like someone was debugging this in 2009

It should be noted that cl_sortkey_prefix was only introduced about half way through 2010, so that person was probably debugging a separate issue.

Anyways, there's nothing wrong with the data, except that when you convert ไ U+0E44 THAI CHARACTER SARA AI MAIMALAI (In hex that's 0xE0B984) from binary to latin1 to binary to utf-8 in mysql, things explode. However its perfectly valid unicode, the only problem is converting between latin-1 utf-8 and binary is not a lossless operation. [Ignore this, i made a typo when checking things]

Sorry, I made a mistake in my last comment. It does however appear to not be a problem with latin-1 encoding, but with dangling unicode characters. (It ends with E0B9. Based on the page title, that's 2/3rds of a ไ in utf-8, which is much more likely than the windows-1252 equivalent of ๠)

In older versions of MW, the max field size for the cl_sortkey (which eventually becomes cl_sortkey_prefix) was 70 bytes long. This gets cut off at 86 bytes so perhaps that was what the field size on Wikimedia cluster at one point. Modern MW should truncate things correctly so there is no dangaling unicode, and also have a max sortkey prefix size of 255 bytes.

This double encoding is the reason why JOINs will fail with the newer VARBINARY fields used everywhere by MediaWiki now.

This is quite complicated and annoying to work with. Source of bugs.

in the interest of figuring out how important this is, could you be more specific with actual examples of where this causes problems. Historically most of the issues with dangling unicode has happened with the various *_comment fields (edit summaries) and I guess the cl_sortkey_prefix thing we are talking about now. None of these are usual targets of joins, since normally they don't really match up with anything else in the db as they are basically free text entered by the user.

What is the scope of this task? DBs/tables on Wikimedia sites? Changing some table definitions in the MediaWiki codebase? DBs/tables in Tool Labs / Labs?

Aklapper: I am going to assume these above based on the feedback (I may be wrong), adding DBA at a later point if something has to be done manually (?).

This proposal is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

Tgr subscribed.

I'm moving this to On Hold for the dev wishlist as it's unclear whether the issue still exists.