Page MenuHomePhabricator

Clean up our five character locale data....
Closed, ResolvedPublic

Description

Following on from our clean up of 2 character locale data we should clean up our 5 character locale data.

This is mostly the same from a technical pov - ie we can add more locales to the job we have already scheduled. There is a small QA task in there however.

The data I'm referring to is the option values that are not real languages - in theory the languages should map to actual languages the users have accessed. However, historically our code just added together the language string 'en' and the country string 'DK' to get 'en_DK' - since that wasn't in the database it was just added.

Later we made it so that it would not add new languages but rather come up with a reaslistic fallback - ie 'en_US' since that is the language we actually send emails in.

However, when we look at our language variants there are two types - 'real ones' and 'made up ones' . Since we fall back to 'en_US' for all of them currently anyway I don't think we need to be too careful about the 'real ones' - but I think we should at least 'legitimise' obvious real ones - such as 'en_NZ' "English (New Zealand)" and 'en_IN' "English (India)" which we know to be official languages of the respective countries without too much research. I added an upstream gitlab for this too https://lab.civicrm.org/dev/core/-/issues/3928

image.png (1×1 px, 118 KB)

For the made up ones - we should fix the contact languages, using the process control method we used for the two letter ones & remove the option

Event Timeline

till Oct 26, 2022 civi en email only valid for prefer language with en_US/ en_AU / en_CA / en_GB / en_ZA / en_NZ / en_IN (should make last 2 official), while db got below. and they are also added as option for civi, needs to clean them and them remove the option.
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "en_%" and preferred_language not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA', 'en_NZ'. 'en_IN') group by preferred_language;

preferred_languagecount(*)
en_AD83
en_AE2
en_AF67
en_AG65
en_AI19
en_AL187
en_AM192
en_AO45
en_AR19322
en_AS9
en_AT47242
en_AW65
en_AX13
en_AZ158
en_BA249
en_BB164
en_BD117
en_BE84506
en_BF19
en_BG2475
en_BH326
en_BI7
en_BL1
en_BM327
en_BN95
en_BO84
en_BQ23
en_BR84713
en_BS261
en_BT12
en_BW138
en_BZ80
en_CF3
en_CG6
en_CH4419
en_CK34
en_CL13813
en_CM34
en_CN15178
en_CO10201
en_CR923
en_CW41
en_CY889
en_CZ3845
en_DE22393
en_DJ10
en_DK40813
en_DM44
en_DO476
en_DZ54
en_EC460
en_EE3477
en_EG523
en_ES75582
en_ET50
en_FJ96
en_FK11
en_FM10
en_FO71
en_FR235595
en_GA15
en_GD49
en_GE322
en_GF25
en_GG55
en_GH93
en_GI128
en_GL32
en_GM12
en_GN9
en_GP51
en_GQ3
en_GR3283
en_GT352
en_GU61
en_GY39
en_HK18563
en_HN207
en_HR1542
en_HT41
en_HU23131
en_IE249184
en_IL76987
en_IM89
en_IN1605717
en_IS1001
en_IT26980
en_JE76
en_JM568
en_JO309
en_JP49541
en_KE76
en_KG39
en_KH252
en_KN43
en_KW805
en_KY228
en_KZ316
en_LA44
en_LC49
en_LI29
en_LK457
en_LR15
en_LS31
en_LT190
en_LU7088
en_LV8059
en_MA279
en_MC212
en_MD258
en_ME88
en_MF11
en_MG30
en_MH8
en_MK142
en_ML12
en_MN75
en_MO77
en_MP8
en_MQ52
en_MR5
en_MT793
en_MU182
en_MV47
en_MW38
en_MX56536
en_MY61319
en_MZ85
en_NA149
en_NC57
en_NG870
en_NI116
en_NL336839
en_NO21439
en_NP85
en_NZ255635
en_OM320
en_PA626
en_PE5863
en_PF42
en_PG83
en_PH2739
en_PK913
en_PL37656
en_PR584
en_PS28
en_PT37547
en_PW15
en_PY37
en_QA864
en_RE90
en_RO32879
en_RS926
en_RW37
en_SB11
en_SC24
en_SE158190
en_SG25644
en_SI1398
en_SK2464
en_SL25
en_SN34
en_SR16
en_SV187
en_SX25
en_SZ17
en_TC64
en_TD6
en_TH1962
en_TJ10
en_TL9
en_TM8
en_TN115
en_TO12
en_TT876
en_TW2138
en_TZ104
en_UA9928
en_UG105
en_UY3519
en_UZ22
en_VC23
en_VE363
en_VG35
en_VI74
en_VN1546
en_VU21
en_WS8
en_XX246
en_ZM84

190 rows in set (2.117 sec)

MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "en_%" and preferred_language not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA') group by is_deleted;

is_deletedcount(*)
03291583
1520646

2 rows in set (11.325 sec)
Start for this one, that we should clean 3291583 of contact and then clean the option table civicrm_option_value where value = 'en';
with the drush commend as sh -c "echo '{\"values\":{\"preferred_language\":\"en_US\"},\"where\":[[\"preferred_language\",\"NOT IN\",[\"en_US\", \"en_CA\",\"en_ZA\", \"en_AU\", \"en_GB\", \"en_NZ\", \"en_IN\"]], [\"preferred_language\",\"LIKE\",\"en_%\"]],\"limit\":5000, \"version\":4}' | drush @wmff cvapi Contact.update --in=json"

then run delete from civicrm_option_value where value = 'en' and name not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA','en_NZ', 'en_IN'); to get rid of the invalid en option from civi dropdown

echo '{"where":[["name","NOT IN",["en_GB","en_US","en_CA","en_AU","en_ZA","en_NZ","en_IN"]],["value","=","en"]]}' | cv api4 OptionValue.delete --in=json;
actually use https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/856688/ this (drush @wmff cvapi WMFDataManagement.CleanInvalidLanguageOptions version=4) to clean the unused language options

still seeing some new contact been created with the invalid en_xx, id 30851434, 57584905 and 750707 from online donation. but switching to other language first

for spanish, civi supports es_ES, es_MX, and es_PR, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "es_%" and preferred_language not in ('es_PR', 'es_MX', 'es_ES') group by preferred_language;

preferred_languagecount(*)
es_419263
es_AD107
es_AE20
es_AF8
es_AO5
es_AR176097
es_AT334
es_AU168
es_AW1
es_AZ1
es_BE756
es_BF1
es_BG13
es_BJ1
es_BM2
es_BO88
es_BR296
es_BT2
es_CA467
es_CH80
es_CK1
es_CL93076
es_CN47
es_CO72733
es_CR1642
es_CV3
es_CY17
es_CZ52
es_DE470
es_DK168
es_DO1168
es_DZ2
es_EC1014
es_EE10
es_EG2
es_ET1
es_FR446
es_GB854
es_GE2
es_GH4
es_GI2
es_GR26
es_GT392
es_GW1
es_GY1
es_HK37
es_HN217
es_HR5
es_HT3
es_HU47
es_IE117
es_IL119
es_IN8
es_IS3
es_IT713
es_JO2
es_JP94
es_KE3
es_KG1
es_KH5
es_KW6
es_KZ3
es_LA2
es_LT5
es_LU99
es_LV10
es_MA8
es_MD2
es_ME1
es_MT8
es_MV1
es_MW2
es_MY12
es_MZ2
es_NA1
es_NE1
es_NI250
es_NL359
es_NO170
es_NZ46
es_OM3
es_PA361
es_PE67078
es_PF1
es_PH8
es_PK1
es_PL81
es_PT266
es_PY191
es_QA15
es_RO46
es_RS3
es_RW1
es_SC1
es_SE177
es_SG20
es_SI2
es_SK17
es_SN1
es_SV245
es_TH23
es_TN5
es_TT4
es_TW9
es_UA15
es_US23033
es_UY17388
es_VE1985
es_VG1
es_VN7
es_XX144
es_ZA12

112 rows in set (0.298 sec)
MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "es_%" and preferred_language not in ('es_PR', 'es_MX', 'es_ES') group by is_deleted;

is_deletedcount(*)
0462021
12329

2 rows in set (1.363 sec)

462021 got cleaned up

for french, civi supports fr_FR, fr_CA, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "fr_%" and preferred_language not in ('fr_FR', 'fr_CA') group by preferred_language;

preferred_languagecount(*)
fr_AD40
fr_AE54
fr_AL19
fr_AM3
fr_AR27
fr_AT452
fr_AU224
fr_BE62979
fr_BG40
fr_BJ23
fr_BR163
fr_CG10
fr_CH1992
fr_CL33
fr_CM31
fr_CN138
fr_CO24
fr_CR14
fr_CV4
fr_CY22
fr_CZ98
fr_DE1138
fr_DK175
fr_DM2
fr_DO16
fr_DZ64
fr_EC7
fr_EE26
fr_EG9
fr_ES2773
fr_ET4
fr_GA16
fr_GB1452
fr_GF140
fr_GP273
fr_GR287
fr_GT2
fr_HK174
fr_HN1
fr_HT11
fr_HU9
fr_IE194
fr_IL172
fr_IN14
fr_IS25
fr_IT1300
fr_JP32
fr_KE4
fr_KH24
fr_KW10
fr_LA6
fr_LT19
fr_LU3821
fr_LV21
fr_MA336
fr_MC124
fr_MF10
fr_MG33
fr_ML24
fr_MN1
fr_MQ254
fr_MR14
fr_MU70
fr_MX111
fr_MY25
fr_NC198
fr_NG7
fr_NI2
fr_NL802
fr_NO193
fr_NZ79
fr_PA11
fr_PE27
fr_PF212
fr_PH29
fr_PL20
fr_PT668
fr_QA27
fr_RE524
fr_RO216
fr_RS23
fr_SE153
fr_SG147
fr_SI24
fr_SK50
fr_SN54
fr_SZ1
fr_TD7
fr_TH105
fr_TN127
fr_TT1
fr_TW30
fr_UA16
fr_US4095
fr_UY14
fr_VE11
fr_VN60
fr_XX26
fr_ZA52

99 rows in set (0.060 sec)

MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "fr_%" and preferred_language not in ('fr_FR', 'fr_CA') group by is_deleted;

is_deletedcount(*)
082535
15089

2 rows in set (0.653 sec)

cleaned with option value cleaned

for chinese, civi supports zh_TW, zh_CN, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "zh_%" and preferred_language not in ('zh_TW', 'zh_CN') group by preferred_language;

preferred_languagecount(*)
zh_AT10
zh_AU383
zh_BE13
zh_C28
zh_CA453
zh_CH1
zh_DE16
zh_DK13
zh_EE3
zh_ES126
zh_FR71
zh_GB198
zh_hans15860
zh_hant14530
zh_HK3812
zh_HU3
zh_IE13
zh_IT69
zh_JP444
zh_LU2
zh_MO9
zh_MX3
zh_MY268
zh_NL69
zh_NO12
zh_NZ549
zh_PH40
zh_PL5
zh_PT8
zh_SE30
zh_SG51
zh_TH10
zh_US11457
zh_ZA6

34 rows in set (0.036 sec)
MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "zh_%" and preferred_language not in ('zh_TW', 'zh_CN') group by is_deleted;

is_deletedcount(*)
047439
11106

2 rows in set (0.327 sec)
cleaned up

greg triaged this task as Medium priority.Nov 15 2022, 9:02 PM

Change 857083 had a related patch set uploaded (by Wfan; author: Wfan):

[wikimedia/fundraising/crm@master] Filter out is_deleted = 1 for valid contact local options

https://gerrit.wikimedia.org/r/857083

Change 857083 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Filter out is_deleted = 1 for valid contact local options

https://gerrit.wikimedia.org/r/857083

based on this query: select label, value, count( * ) as count from civicrm_option_value where name like '%_%' and label like '%_%' and is_active = 1 and option_group_id=86 group by value having count > 1 order by count desc;

labelvaluecountvalid language codeCleaned?
Italianit121it_IT (Italy), it_CH (Switzerland), it_SM (San Marino), it_VA (Vatican City)
Germande95de_DE (Germany), de_CH (Switzerland), de_AT (Austria), de_BE(Belgium), de_LI (Liechtenstein) , de_ LU(Luxembourg)
Japaneseja81ja_JP
Swedishsv77sv_SE (Sweden), sv_FI (Finland), sv_AX (Åland Islands)
Russianru76ru_RU, ru_BY (Belarus), ru_KZ(Kazakhstan), ru_KG(Kyrgyzstan)
Dutch (Netherlands)nl67nl_NL, nl_BE(Belgium), nl_SR(Suriname), nl_ZA(South Africa)
Hebrew (modern)he62he_IL
Portuguese (Portugal)pt57pt_PT (Portugal), pt_BR(Brazil) , pt_AO(Angola), pt_GW(Guinea-Bissau), pt_CV(Cabo Verde), pt_GQ (Equatorial Guinea), pt_ST(São Tomé and Príncipe), pt_MZ(Mozambique), pt_TL(Timor-Leste)
Polishpl49pl_PL(Poland), pl_CZ(Czech Republic), pl_HU(Hungary), pl_LT(Lithuania), pl_RO(Romania) , pl_SK(Slovakia), pl_ UA (Ukraine)
Norwegian Bokmålnb39nb_NO
Romanian, Moldavian, Moldovanro37ro_RO
Danishda36da_DK, da_GL (Greenland)
Hungarianhu34hu_HU
Ukrainianuk32uk_UA
Norwegianno26no_NO
Slovaksk25sk_SK
Turkishtr24tr_TR, tr_CY(Cyprus)
Arabic (ar)ar23ar_EG
Latvianlv22lv_LV
Catalan; Valencianca22ca_ES
Czechcs21cs_CZ
Persian (Iran)fa16fa_IR
Sardiniansc14sc_IT (not active)
qq_USqq13???????? (invalid languange, update to en_US)
Vietnamesevi13vi_VN
Bulgarianbg12bg_BG
Koreanko12ko_KR
Thaith10th_TH
Croatianhr8hr_HR
Serbiansr8sr_RS
Esperantoeo8eo_XX
Finnishfi8fi_FI
Greek, Modernel7el_GR
Lithuanianlt7lt_LT
English (United States)en7en_US, en_AU, en_CA, en_GB, en_ZA, en_IN, en_SG
Indonesianid7id_ID
yue_CNyu6???????? (update to zh, and label as zh_HK which is for Cantonese)
Georgianka5ka_GE
Latinla5la_VA
Interlinguaia5ia_XX
Albanian (sq)sq4sq_AL
Basque (eu)eu4eu_ES
Urduur4ur_PK
Bengali (bn)bn4bn_BD
Hindihi3hi_IN
Macedonianmk3mk_MK
Burmesemy3my_MM
Bosnianbs3bs_BA
Spanish; Spaines3es_ES, es_PR, es_MX
Galiciangl3gl_ES
Kurdishku2ku_IQ
Panjabi, Punjabipa2pa_IN
Chinese (China)zh2zh_CN, zh_TW, zh_HK
Telugute2te_IN
Luxembourgish, Letzeburgeschlb2lb_LU
ba_JPba2ba_RU (not active) & ba_JP and ba_PT same as label
Armenian (hy)hy2hy_AM
Tagalogtl2tl_PH
Belarusian (be)be2be_BY
Maltesemt2mt_MT
Scottish Gaelic; Gaelicgd2gd_GB
Marathimr2mr_IN
Sinhala, Sinhalesesi2si_LK
Estonianet2et_EE
Irishga2ga_IE
Welshcy2cy_GB
Nepaline2ne_NP
ve_USve2ve_ZA (not active) & ve_IS and ve_IT same as label
tg_ITtg2tg_TJ (not active) & tg_IT and tg_CA same as label
French (France)fr2fr_FR, fr_CA
Occitan (after 1500)oc2oc_FR, oc_ES(Spain)
Oriyaor2or_IN
Twitw2tw_GH (not active) & tw_TW and tw_CA same as label

73 rows in set (0.008 sec)

XenoRyet set Final Story Points to 8.