Page MenuHomePhabricator

DBs provided on Tool Labs should have utf8 charset and collation by default
Closed, ResolvedPublic

Description

The DB I got was latin1 and latin1_swedish_ci. (?!)

Event Timeline

Ijon raised the priority of this task from to Needs Triage.
Ijon updated the task description. (Show Details)
Ijon added a project: Toolforge.
Ijon subscribed.
Restricted Application added subscribers: StudiesWorld, Josve05a, Aklapper. · View Herald Transcript

Not sure if it should be utf8 or utf8mb4.

valhallasw set Security to None.
valhallasw removed a subscriber: jcrespo.
valhallasw added subscribers: valhallasw, jcrespo.

Edit conflict :(

Probably utf8mb4, with either the utf8_unicode_ci or utf8mb4_bin collation.

I'm not entirely sure what the impact of this is on databases created from data from replica databases. If I remember correctly, they store utf-8 binary data in a column marked as latin1 -- and copying that data to a database with utf8mb4 encoding might mangle the data.

Replica databases have to stay in binary collation. I am talking here about toolsdb, which certainly is misconfigured.

Change 260558 had a related patch set uploaded (by Jcrespo):
Setting default character set as utf8mb4 for toolsdb

https://gerrit.wikimedia.org/r/260558

Change 260558 merged by Jcrespo:
Setting default character set as utf8mb4 for toolsdb

https://gerrit.wikimedia.org/r/260558

Change 260559 had a related patch set uploaded (by Jcrespo):
Correcting typo on tooldb client configuration

https://gerrit.wikimedia.org/r/260559

Change 260559 merged by Jcrespo:
Correcting typo on tooldb client configuration

https://gerrit.wikimedia.org/r/260559

$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 201372010
Server version: 5.5.39-MariaDB-log Source distribution

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB TOOLSDB master localhost (none) > SHOW GLOBAL VARIABLES like 'character%';
+--------------------------+----------------------------------+
| Variable_name            | Value                            |
+--------------------------+----------------------------------+
| character_set_client     | utf8mb4                          |
| character_set_connection | utf8mb4                          |
| character_set_database   | utf8mb4                          |
| character_set_filesystem | binary                           |
| character_set_results    | utf8mb4                          |
| character_set_server     | utf8mb4                          |
| character_set_system     | utf8                             |
| character_sets_dir       | /opt/wmf-mariadb/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)

This is fixed, however: character set is something that it is negotiated between the client and the server, and can be changed by the client (connection) at any time.

The local client should have good defaults, that can be cofigured on the client's /etc/my.cnf or $HOME/.my.cnf or by executing "SET NAMES".

Some services, like some versions of php, ignore server's suggestion and set its own default.

I've fixed everything I can on server side, but I strongly suggest setting the right character set on connection and on database creation. Also, existing data is not converted automatically, databases and tables created with latin1 will continue to be latin1 unless specifically converted. See more info: http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

jcrespo closed this task as Resolved.EditedDec 22 2015, 2:28 PM
jcrespo claimed this task.
MariaDB TOOLSDB master localhost (none) > SHOW GLOBAL VARIABLES like 'collat%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_unicode_ci |
| collation_database   | utf8mb4_unicode_ci |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)

Of course, not all databases will need UTF-8 (utf8mb4) character set- those are the user's/application responsibility to be configured.