Page MenuHomePhabricator

Deal with utf8 issues for new install
Closed, ResolvedPublic

Description

We are having ongoing hiccups related to the charset on our install being 'less preferred' and more specifically no longer supported for new installs for CiviCRM. For most sites this new-install limit is not an issue as their site is installed & not changing.

However, for us the issue is that we are constantly creating new installs - both in CI and in our dev environments. It seems that our default charset in our docker (and on live) is DEFAULT CHARSET=latin1 - this means that if we don't explictly specify the charset for new tables it uses latin1 - which results in the test fail @Cstone experienced and which I was able to replicate. Since those tests are NOT failing on jenkins I think jenkins must have a default of utf8 since testDiscriticConflicts works on jenkins.

The 'what has changed' in all this is that following new civicrm recommendations we are no longer specifying the charset when we create the civicrm_contact_name_pair table. This is 'correct' as the db charset should be either utf8 or utfmb48 and by not specifying we can pick up the right one.

However, when it is not right it goes to custard....

I think there are a couple of things we can do

  1. I think we should update live to use utf8 as the default charset - this won't change existing tables, just new ones created with no override which is a likely future gotcha if we don't
  2. an easy short-term fix might be to let that same upgrade script set the db default for our local dbs. However, it would probably be better to have that as part of the docker settings in the medium term

@AndyRussG @jgleeson pinging you in also because of the note about docker although I think I'll do the quick fix first & potentially create a follow up ticket
@Cstone this is the issue you were hitting & I think it's also the problem I've been seeing on my upgrade prep (the failing test in gerrit)

Event Timeline

@Dwisehaupt @Jgreen I just tried to run this on staging civicrm db

ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;

But it's above my paygrade. I suspect the same is true on live too. Making this change would mean all CREATE TABLES created that do not specifically state the character set & collation will use the above. It won't change exisitng tables but they would already be using those charsets.

I think it's probably best if you make that change on live & I'll focus on how best to do it for our dev scenarios

Just noting we'll probably want utf8mb4 to get full 4 byte utf8 characters instead of the 3 byte default.

ALTER DATABASE dbname CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

https://mariadb.com/kb/en/unicode/
https://mariadb.com/kb/en/supported-character-sets-and-collations/

I could totally be convinced to go with utf8 and then just certain tables/columns as utf8mb4, but it seems to be more future proof to just go 4 byte from the get go.

@Dwisehaupt I think utf8 is better and then change to utfmb4 when we alter the tables. Currently all our tables are utf8 - which we want to change to utf8mb4. However, we need to avoid creating some tables using utf8mb4 accidentally when the rest are on utf8 because the joins won't work between utf8 & utf8mb4 fields

(roll on db convert!)

Complete on frdev1001. Some dbs listed are still latin1/etc as I will be changing them on the primary.

Complete on frdb1002 and verified that replication did the right thing. You should be good to proceed.

Cool - I still have to deal with the docker images charset but at least I know prod is done

Change 662809 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Set mysql global row format to dynamic

https://gerrit.wikimedia.org/r/662809

Change 662809 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Set mysql global row format to dynamic

https://gerrit.wikimedia.org/r/662809

Change 662819 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update geocoder module

https://gerrit.wikimedia.org/r/662819

Change 663044 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm/civicrm@master] Replace utf8mb4 with utf8 for CI to work

https://gerrit.wikimedia.org/r/663044

Change 663045 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm/civicrm@master] Remove charset declarations & reply on db default

https://gerrit.wikimedia.org/r/663045

Change 663044 abandoned by Eileen:
[wikimedia/fundraising/crm/civicrm@master] Replace utf8mb4 with utf8 for CI to work

Reason:

https://gerrit.wikimedia.org/r/663044

Change 663047 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm/civicrm@master] Try removing utf8 declaration from pre-upgrade code

https://gerrit.wikimedia.org/r/663047

Change 662819 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Update geocoder module

https://gerrit.wikimedia.org/r/662819

Change 663089 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update data_strings to not specify engine

https://gerrit.wikimedia.org/r/663089

Change 663047 abandoned by Eileen:
[wikimedia/fundraising/crm/civicrm@master] Try removing utf8 declaration from pre-upgrade code

Reason:

https://gerrit.wikimedia.org/r/663047

Change 663045 abandoned by Eileen:
[wikimedia/fundraising/crm/civicrm@master] Remove charset declarations & reply on db default

Reason:

https://gerrit.wikimedia.org/r/663045

Change 663098 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Aha - it's unicode not general in the collate

https://gerrit.wikimedia.org/r/663098

@Dwisehaupt working through this on our CI enviroment we learnt 2 things

  1. the following global settings are required to support utf8mb4 (some of which are already set) SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_default_row_format='dynamic'; SET GLOBAL innodb_file_per_table = 1; SET GLOBAL innodb_large_prefix = ON;
  1. The collation for the drupal db & civi db differs

civicrm - utf8_unicode_ci / utf8mb4_unicode_ci
drupal - utf8_general_ci / utf8mb4_general_ci

At the moment both are set to general on live

(note we won't be creating any new tables tomorrow so resolving this is not on the critical part but it is housekeeping)

Change 663089 merged by Eileen:
[wikimedia/fundraising/crm@master] Update data_strings to not specify engine

https://gerrit.wikimedia.org/r/663089

Change 663098 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Aha - it's unicode not general in the collate

https://gerrit.wikimedia.org/r/663098

The collation setting has been adjusted on the civicrm and dev_civicrm databases. Innodb global setting changes have been tasked in T274438.

Change 663337 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Add deduper patch

https://gerrit.wikimedia.org/r/663337

@Dwisehaupt working through this on our CI enviroment we learnt 2 things

  1. the following global settings are required to support utf8mb4 (some of which are already set) SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_default_row_format='dynamic'; SET GLOBAL innodb_file_per_table = 1; SET GLOBAL innodb_large_prefix = ON;

These have also been applied and codified in puppet for my.cnf. Merged and pushed out as stated in T274438. All set from our side for those settings.

Change 663337 abandoned by Eileen:
[wikimedia/fundraising/crm@master] Add deduper patch

Reason:

https://gerrit.wikimedia.org/r/663337