Page MenuHomePhabricator

Broken unicode characters / invalid UTF-8 on Tool Labs index
Closed, ResolvedPublic

Description

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

the beta is actually https://tools.wmflabs.org/admin-beta/tools but it shows the same problem.

This doesn't look right either:

$ ldap 'uid=jeanfred' cn
dn: uid=jeanfred,ou=people,dc=wikimedia,dc=org
cn:: SmVhbi1GcsOpZMOpcmlj

Do something change with our LDAP servers recently?

Direct mysql query shows the encoding problem in the backing database:

(u3518@tools-db) [toollabs_p]> select * from users where name = 'jeanfred'\G
*************************** 1. row ***************************
    name: jeanfred
      id: 3076
wikitech: Jean-Frédéric
    home: /home/jeanfred

That table is maintained by the updatetools script which runs on tools-services-01.tools.eqiad.wmflabs.

My guess at this point is either that something changed in our LDAP server config or in the config for that database that is now causing the unicode data to be stored in the db incorrectly.

This looks ok:

$ getent passwd jeanfred
jeanfred:x:3076:500:Jean-Frédéric:/home/jeanfred:/bin/bash

The python script is doing something similar via pwd.getpwnam(project_member).

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pwd
>>> pwd.getpwnam('jeanfred').pw_gecos
'Jean-Fr\xc3\xa9d\xc3\xa9ric'

The string Jean-Fr\xc3\xa9d\xc3\xa9ric is in a unicode encoding. Python doesn't know that however; it's just a pile of bytes until the decode('utf-8') step gives it an encoding.

>>> pwd.getpwnam('jeanfred').pw_gecos.decode('utf-8')
u'Jean-Fr\xe9d\xe9ric'

Both the encoded and decoded versions print ok in my terminal though.

>>> print pwd.getpwnam('jeanfred').pw_gecos
Jean-Frédéric
>>> print pwd.getpwnam('jeanfred').pw_gecos.decode('utf-8')
Jean-Frédéric

Let's go look at the database side. We saw before that sql local gave the same UTF-8 treated as latin1 mess as we are seeing on the admin tool: Jean-Frédéric

(u3518@tools-db) [toollabs_p]> SHOW FULL COLUMNS FROM users;
+----------+-------------+-----------------+------+-----+---------+-------+------------+---------+
| Field    | Type        | Collation       | Null | Key | Default | Extra | Privileges | Comment |
+----------+-------------+-----------------+------+-----+---------+-------+------------+---------+
| name     | varchar(64) | utf8_general_ci | NO   | PRI | NULL    |       | select     |         |
| id       | int(11)     | NULL            | NO   | UNI | NULL    |       | select     |         |
| wikitech | varchar(64) | utf8_general_ci | NO   | UNI | NULL    |       | select     |         |
| home     | text        | utf8_general_ci | NO   |     | NULL    |       | select     |         |
+----------+-------------+-----------------+------+-----+---------+-------+------------+---------+
4 rows in set (0.00 sec)

(u3518@tools-db) [toollabs_p]> SELECT @@character_set_database;
+--------------------------+
| @@character_set_database |
+--------------------------+
| utf8                     |
+--------------------------+
1 row in set (0.01 sec)

So mysql thinks this should be UTF-8 data.

(u3518@tools-db) [toollabs_p]> SET NAMES 'utf8';
Query OK, 0 rows affected (0.00 sec)

(u3518@tools-db) [toollabs_p]> select * from users where name = 'jeanfred'\G
*************************** 1. row ***************************
    name: jeanfred
      id: 3076
wikitech: Jean-Frédéric
    home: /home/jeanfred
1 row in set (0.01 sec)

(u3518@tools-db) [toollabs_p]> SET NAMES 'latin1';
Query OK, 0 rows affected (0.00 sec)

(u3518@tools-db) [toollabs_p]> select * from users where name = 'jeanfred'\G
*************************** 1. row ***************************
    name: jeanfred
      id: 3076
wikitech: Jean-Frédéric
    home: /home/jeanfred
1 row in set (0.00 sec)

Ummm... so if I tell the mysql shell that I'm using to treat things as latin1 then I see the correct output in my terminal.

I live hacked the admin tool and found that it is connecting to mysql with a characterset of utf8mb4. If I change that to utf8 things are still messed up in the same way, but if I change it with $db->set_charset( 'latin1' ); the page output looks correct. I have live hacked the two pages that connect to the database and added that setting. I'll prepare a proper patch later.

Mentioned in SAL (#wikimedia-labs) [2017-05-10T22:50:58Z] <bd808> Live hacked for utf-8 character encoding problem (T164971)

I have a workaround for this now via PHP changes, but something must have changed to cause this encoding error. I know the PHP application code has not changed for some time. updatetools hasn't changed since it was added to puppet. Something could have changed in the system PHP configuration. It is effecting a tool running on a Trusty host and a Jessie container equally however so that seems unlikely. We changed backend LDAP technology at one point, but that was quite some time ago. I find it unlikely that nobody reported this kind of encoding error for months. Right now that leaves me wondering if something changed about the mysql server.

@jcrespo, would the tools-db host have had configuration or software changes in the last month or two that might have altered how python stores strings or php reads them?

The only thing that changed "last month or two" was the upgrade to 5.5 to MariaDB 10 (T157358). Some defaults changed there, but it is normally the responsibility of the developer to send and set the default charset he or she wants to use (mainly because it is used by many applications, each with a separate needs. In particular, once a database is created, changing the default connection charset doesn't change the underlying database, table or column format- so you may have been using latin1 from the start? Setting utf8 as the default character set on that server, however, took place long time before that, at least a year or 2: T122148 , so I am going to guess maybe a connector update using by default utf8 rather than latin1 (e.g. maybe a jessie upgrade)? Or maybe the column types were at some point changed without a proper byte conversion? I know people keep trying to store utf8 into latin1 fields and there is some legacy to such horrible applications in the past.

With this I am not trying to say "not my fault, bye"; it doesn't matter who or when, the point is we should avoid latin1 charset (specially for production databases)- and that current contents can be converted normally with no data loss to either utf8mb4 or binary. I can convert that for you, but we need to coordinate to stop hardcoding latin1 as the charset at the same time that the conversion happens.

This comment kind of summarizes the whole thing- and it is probably very relevant as I will assume this is PHP: T122148#1898323

bd808 triaged this task as Medium priority.

@jcrespo I agree that this is the result of sloppy handling of encoding on the part of the applications that read/write to this database. The proper fix will be to convert the tables to use utf8mb4 and ensure that both the python and php scripts that are accessing the tables negotiate the proper encoding when connecting. This whole database is repopulated on a very regular basis so that shouldn't be too hard once I do some tests to make sure I understand how to configure both clients.

I am not asking you to handle it on your own, I can just convert it and we can change the config to use utf8mb4. Should we test on a sample table first?

I am not asking you to handle it on your own, I can just convert it and we can change the config to use utf8mb4. Should we test on a sample table first?

I have things hacked so that they are working at the moment. My plan for fixing is to create a new "uft8mb4_toollabs_p" database with the proper utf8mb4 encoding on its tables and then write test code in python2 populate it and PHP code to read from this db. When I can get that working such that python, php, and the mysql cli client all agree on how to handle the unicode data I'll put up a Puppet patch to change the Python script that populates the tables and another to change the PHP webservice that renders the pages. It will probably take me a few days to find time to work on this with upcoming travel, but that's ok. We can live with the ugly utf-8 in latin1 storage workarounds for a while.

bd808 lowered the priority of this task from Medium to Low.Jan 10 2018, 5:07 PM

Still needs doing properly 9 months later... :/

Mentioned in SAL (#wikimedia-cloud) [2019-07-30T04:19:03Z] <bd808> Update to 13cade0 "Use utf8mb4 for database encoding" (T164971)

MariaDB [toollabs_p]> alter table users convert to character set utf8mb4;
Query OK, 1810 rows affected (0.57 sec)
Records: 1810  Duplicates: 0  Warnings: 0

MariaDB [toollabs_p]> alter table tools convert to character set utf8mb4;
Query OK, 2360 rows affected (0.62 sec)
Records: 2360  Duplicates: 0  Warnings: 0

Mentioned in SAL (#wikimedia-cloud) [2019-07-30T04:21:51Z] <bd808> Update to 13cade0 "Use utf8mb4 for database encoding" (T164971)

Change 526309 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: modernize updatetools script

https://gerrit.wikimedia.org/r/526309

Change 526309 merged by Bstorm:
[operations/puppet@production] toolforge: modernize updatetools script

https://gerrit.wikimedia.org/r/526309

bd808 removed bd808 as the assignee of this task.May 20 2020, 7:10 PM
bd808 lowered the priority of this task from Low to Lowest.
bd808 moved this task from Doing to Soon! on the cloud-services-team (Kanban) board.

I honestly don't even remember what this is stuck on at this point. :/

Krinkle claimed this task.

Seems to work fine now. I don't see any mangled names now.