Clean up our five character locale data....
Closed, ResolvedPublic
Actions

Description

Following on from our clean up of 2 character locale data we should clean up our 5 character locale data.

This is mostly the same from a technical pov - ie we can add more locales to the job we have already scheduled. There is a small QA task in there however.

The data I'm referring to is the option values that are not real languages - in theory the languages should map to actual languages the users have accessed. However, historically our code just added together the language string 'en' and the country string 'DK' to get 'en_DK' - since that wasn't in the database it was just added.

Later we made it so that it would not add new languages but rather come up with a reaslistic fallback - ie 'en_US' since that is the language we actually send emails in.

However, when we look at our language variants there are two types - 'real ones' and 'made up ones' . Since we fall back to 'en_US' for all of them currently anyway I don't think we need to be too careful about the 'real ones' - but I think we should at least 'legitimise' obvious real ones - such as 'en_NZ' "English (New Zealand)" and 'en_IN' "English (India)" which we know to be official languages of the respective countries without too much research. I added an upstream gitlab for this too https://lab.civicrm.org/dev/core/-/issues/3928

For the made up ones - we should fix the contact languages, using the process control method we used for the two letter ones & remove the option

Details

	Subject	Repo	Branch	Lines +/-
	Filter out is_deleted = 1 for valid contact local options	wikimedia/fundraising/crm	master	+2 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		AnnWF	T320829 Clean up our two-character locale data
		Resolved		AnnWF	T321251 Clean up our five character locale data....

Event Timeline

Eileenmcnaughton created this task.Oct 20 2022, 12:24 AM

Eileenmcnaughton updated the task description. (Show Details)

XenoRyet moved this task from Triage to DRI Backlog on the Fundraising-Backlog board.Oct 24 2022, 7:42 PM

AnnWF added a project: Fundraising Tech - Chaos Crew.Oct 25 2022, 4:29 PM

AnnWF moved this task from Backlog to Documentation and Maintenence on the Fundraising Tech - Chaos Crew board.

AnnWF moved this task from Documentation and Maintenence to In Progress on the Fundraising Tech - Chaos Crew board.

AnnWF edited projects, added Fundraising Sprint Turtles that are robotic that destroy the whole world with their foot theory; removed Fundraising-Backlog.Oct 26 2022, 5:43 AM

AnnWF moved this task from Backlog to Doing on the Fundraising Sprint Turtles that are robotic that destroy the whole world with their foot theory board.

till Oct 26, 2022 civi en email only valid for prefer language with en_US/ en_AU / en_CA / en_GB / en_ZA / en_NZ / en_IN (should make last 2 official), while db got below. and they are also added as option for civi, needs to clean them and them remove the option.
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "en_%" and preferred_language not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA', 'en_NZ'. 'en_IN') group by preferred_language;

preferred_language	count(*)
en_AD	83
en_AE	2
en_AF	67
en_AG	65
en_AI	19
en_AL	187
en_AM	192
en_AO	45
en_AR	19322
en_AS	9
en_AT	47242
en_AW	65
en_AX	13
en_AZ	158
en_BA	249
en_BB	164
en_BD	117
en_BE	84506
en_BF	19
en_BG	2475
en_BH	326
en_BI	7
en_BL	1
en_BM	327
en_BN	95
en_BO	84
en_BQ	23
en_BR	84713
en_BS	261
en_BT	12
en_BW	138
en_BZ	80
en_CF	3
en_CG	6
en_CH	4419
en_CK	34
en_CL	13813
en_CM	34
en_CN	15178
en_CO	10201
en_CR	923
en_CW	41
en_CY	889
en_CZ	3845
en_DE	22393
en_DJ	10
en_DK	40813
en_DM	44
en_DO	476
en_DZ	54
en_EC	460
en_EE	3477
en_EG	523
en_ES	75582
en_ET	50
en_FJ	96
en_FK	11
en_FM	10
en_FO	71
en_FR	235595
en_GA	15
en_GD	49
en_GE	322
en_GF	25
en_GG	55
en_GH	93
en_GI	128
en_GL	32
en_GM	12
en_GN	9
en_GP	51
en_GQ	3
en_GR	3283
en_GT	352
en_GU	61
en_GY	39
en_HK	18563
en_HN	207
en_HR	1542
en_HT	41
en_HU	23131
en_IE	249184
en_IL	76987
en_IM	89
en_IN	1605717
en_IS	1001
en_IT	26980
en_JE	76
en_JM	568
en_JO	309
en_JP	49541
en_KE	76
en_KG	39
en_KH	252
en_KN	43
en_KW	805
en_KY	228
en_KZ	316
en_LA	44
en_LC	49
en_LI	29
en_LK	457
en_LR	15
en_LS	31
en_LT	190
en_LU	7088
en_LV	8059
en_MA	279
en_MC	212
en_MD	258
en_ME	88
en_MF	11
en_MG	30
en_MH	8
en_MK	142
en_ML	12
en_MN	75
en_MO	77
en_MP	8
en_MQ	52
en_MR	5
en_MT	793
en_MU	182
en_MV	47
en_MW	38
en_MX	56536
en_MY	61319
en_MZ	85
en_NA	149
en_NC	57
en_NG	870
en_NI	116
en_NL	336839
en_NO	21439
en_NP	85
en_NZ	255635
en_OM	320
en_PA	626
en_PE	5863
en_PF	42
en_PG	83
en_PH	2739
en_PK	913
en_PL	37656
en_PR	584
en_PS	28
en_PT	37547
en_PW	15
en_PY	37
en_QA	864
en_RE	90
en_RO	32879
en_RS	926
en_RW	37
en_SB	11
en_SC	24
en_SE	158190
en_SG	25644
en_SI	1398
en_SK	2464
en_SL	25
en_SN	34
en_SR	16
en_SV	187
en_SX	25
en_SZ	17
en_TC	64
en_TD	6
en_TH	1962
en_TJ	10
en_TL	9
en_TM	8
en_TN	115
en_TO	12
en_TT	876
en_TW	2138
en_TZ	104
en_UA	9928
en_UG	105
en_UY	3519
en_UZ	22
en_VC	23
en_VE	363
en_VG	35
en_VI	74
en_VN	1546
en_VU	21
en_WS	8
en_XX	246
en_ZM	84

190 rows in set (2.117 sec)

MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "en_%" and preferred_language not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA') group by is_deleted;

is_deleted	count(*)
0	3291583
1	520646

2 rows in set (11.325 sec)
Start for this one, that we should clean 3291583 of contact and then clean the option table civicrm_option_value where value = 'en';
with the drush commend as sh -c "echo '{\"values\":{\"preferred_language\":\"en_US\"},\"where\":[[\"preferred_language\",\"NOT IN\",[\"en_US\", \"en_CA\",\"en_ZA\", \"en_AU\", \"en_GB\", \"en_NZ\", \"en_IN\"]], [\"preferred_language\",\"LIKE\",\"en_%\"]],\"limit\":5000, \"version\":4}' | drush @wmff cvapi Contact.update --in=json"

then run delete from civicrm_option_value where value = 'en' and name not in ('en_US', 'en_AU', 'en_ZA', 'en_GB', 'en_CA','en_NZ', 'en_IN'); to get rid of the invalid en option from civi dropdown

echo '{"where":[["name","NOT IN",["en_GB","en_US","en_CA","en_AU","en_ZA","en_NZ","en_IN"]],["value","=","en"]]}' | cv api4 OptionValue.delete --in=json;
actually use https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/856688/ this (drush @wmff cvapi WMFDataManagement.CleanInvalidLanguageOptions version=4) to clean the unused language options

still seeing some new contact been created with the invalid en_xx, id 30851434, 57584905 and 750707 from online donation. but switching to other language first

for spanish, civi supports es_ES, es_MX, and es_PR, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "es_%" and preferred_language not in ('es_PR', 'es_MX', 'es_ES') group by preferred_language;

preferred_language	count(*)
es_419	263
es_AD	107
es_AE	20
es_AF	8
es_AO	5
es_AR	176097
es_AT	334
es_AU	168
es_AW	1
es_AZ	1
es_BE	756
es_BF	1
es_BG	13
es_BJ	1
es_BM	2
es_BO	88
es_BR	296
es_BT	2
es_CA	467
es_CH	80
es_CK	1
es_CL	93076
es_CN	47
es_CO	72733
es_CR	1642
es_CV	3
es_CY	17
es_CZ	52
es_DE	470
es_DK	168
es_DO	1168
es_DZ	2
es_EC	1014
es_EE	10
es_EG	2
es_ET	1
es_FR	446
es_GB	854
es_GE	2
es_GH	4
es_GI	2
es_GR	26
es_GT	392
es_GW	1
es_GY	1
es_HK	37
es_HN	217
es_HR	5
es_HT	3
es_HU	47
es_IE	117
es_IL	119
es_IN	8
es_IS	3
es_IT	713
es_JO	2
es_JP	94
es_KE	3
es_KG	1
es_KH	5
es_KW	6
es_KZ	3
es_LA	2
es_LT	5
es_LU	99
es_LV	10
es_MA	8
es_MD	2
es_ME	1
es_MT	8
es_MV	1
es_MW	2
es_MY	12
es_MZ	2
es_NA	1
es_NE	1
es_NI	250
es_NL	359
es_NO	170
es_NZ	46
es_OM	3
es_PA	361
es_PE	67078
es_PF	1
es_PH	8
es_PK	1
es_PL	81
es_PT	266
es_PY	191
es_QA	15
es_RO	46
es_RS	3
es_RW	1
es_SC	1
es_SE	177
es_SG	20
es_SI	2
es_SK	17
es_SN	1
es_SV	245
es_TH	23
es_TN	5
es_TT	4
es_TW	9
es_UA	15
es_US	23033
es_UY	17388
es_VE	1985
es_VG	1
es_VN	7
es_XX	144
es_ZA	12

112 rows in set (0.298 sec)
MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "es_%" and preferred_language not in ('es_PR', 'es_MX', 'es_ES') group by is_deleted;

is_deleted	count(*)
0	462021
1	2329

2 rows in set (1.363 sec)

462021 got cleaned up

for french, civi supports fr_FR, fr_CA, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "fr_%" and preferred_language not in ('fr_FR', 'fr_CA') group by preferred_language;

preferred_language	count(*)
fr_AD	40
fr_AE	54
fr_AL	19
fr_AM	3
fr_AR	27
fr_AT	452
fr_AU	224
fr_BE	62979
fr_BG	40
fr_BJ	23
fr_BR	163
fr_CG	10
fr_CH	1992
fr_CL	33
fr_CM	31
fr_CN	138
fr_CO	24
fr_CR	14
fr_CV	4
fr_CY	22
fr_CZ	98
fr_DE	1138
fr_DK	175
fr_DM	2
fr_DO	16
fr_DZ	64
fr_EC	7
fr_EE	26
fr_EG	9
fr_ES	2773
fr_ET	4
fr_GA	16
fr_GB	1452
fr_GF	140
fr_GP	273
fr_GR	287
fr_GT	2
fr_HK	174
fr_HN	1
fr_HT	11
fr_HU	9
fr_IE	194
fr_IL	172
fr_IN	14
fr_IS	25
fr_IT	1300
fr_JP	32
fr_KE	4
fr_KH	24
fr_KW	10
fr_LA	6
fr_LT	19
fr_LU	3821
fr_LV	21
fr_MA	336
fr_MC	124
fr_MF	10
fr_MG	33
fr_ML	24
fr_MN	1
fr_MQ	254
fr_MR	14
fr_MU	70
fr_MX	111
fr_MY	25
fr_NC	198
fr_NG	7
fr_NI	2
fr_NL	802
fr_NO	193
fr_NZ	79
fr_PA	11
fr_PE	27
fr_PF	212
fr_PH	29
fr_PL	20
fr_PT	668
fr_QA	27
fr_RE	524
fr_RO	216
fr_RS	23
fr_SE	153
fr_SG	147
fr_SI	24
fr_SK	50
fr_SN	54
fr_SZ	1
fr_TD	7
fr_TH	105
fr_TN	127
fr_TT	1
fr_TW	30
fr_UA	16
fr_US	4095
fr_UY	14
fr_VE	11
fr_VN	60
fr_XX	26
fr_ZA	52

99 rows in set (0.060 sec)

MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "fr_%" and preferred_language not in ('fr_FR', 'fr_CA') group by is_deleted;

is_deleted	count(*)
0	82535
1	5089

2 rows in set (0.653 sec)

cleaned with option value cleaned

for chinese, civi supports zh_TW, zh_CN, while from civi db, we have
MariaDB [civicrm]> select preferred_language, count(*) FROM civicrm_contact WHERE preferred_language like "zh_%" and preferred_language not in ('zh_TW', 'zh_CN') group by preferred_language;

preferred_language	count(*)
zh_AT	10
zh_AU	383
zh_BE	13
zh_C2	8
zh_CA	453
zh_CH	1
zh_DE	16
zh_DK	13
zh_EE	3
zh_ES	126
zh_FR	71
zh_GB	198
zh_hans	15860
zh_hant	14530
zh_HK	3812
zh_HU	3
zh_IE	13
zh_IT	69
zh_JP	444
zh_LU	2
zh_MO	9
zh_MX	3
zh_MY	268
zh_NL	69
zh_NO	12
zh_NZ	549
zh_PH	40
zh_PL	5
zh_PT	8
zh_SE	30
zh_SG	51
zh_TH	10
zh_US	11457
zh_ZA	6

34 rows in set (0.036 sec)
MariaDB [civicrm]> select is_deleted, count(*) FROM civicrm_contact WHERE preferred_language like "zh_%" and preferred_language not in ('zh_TW', 'zh_CN') group by is_deleted;

is_deleted	count(*)
0	47439
1	1106

2 rows in set (0.327 sec)
cleaned up

greg added a project: Fundraising-Backlog.Nov 9 2022, 8:49 PM

greg moved this task from DRI Backlog to Current Sprint on the Fundraising-Backlog board.

greg added a project: Fundraising Sprint Undefined.

greg moved this task from Backlog to Doing on the Fundraising Sprint Undefined board.

jgleeson moved this task from In Progress to Backlog on the Fundraising Tech - Chaos Crew board.Nov 10 2022, 5:55 PM

AnnWF mentioned this in T323067: Still seeing new invalid made up locale saved to contact preferred_language.Nov 14 2022, 8:10 PM

AnnWF moved this task from Doing to Blocked in sprint (not fr-tech) on the Fundraising Sprint Undefined board.

AnnWF moved this task from Blocked in sprint (not fr-tech) to Doing on the Fundraising Sprint Undefined board.

greg triaged this task as Medium priority.Nov 15 2022, 9:02 PM

Change 857083 had a related patch set uploaded (by Wfan; author: Wfan):

[wikimedia/fundraising/crm@master] Filter out is_deleted = 1 for valid contact local options

https://gerrit.wikimedia.org/r/857083

gerritbot added a project: Patch-For-Review.Nov 15 2022, 10:51 PM

Change 857083 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Filter out is_deleted = 1 for valid contact local options

https://gerrit.wikimedia.org/r/857083

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2022, 11:30 PM

greg removed a project: Fundraising Tech - Chaos Crew.Nov 18 2022, 12:35 AM

based on this query: select label, value, count( * ) as count from civicrm_option_value where name like '%_%' and label like '%_%' and is_active = 1 and option_group_id=86 group by value having count > 1 order by count desc;

label	value	count	valid language code	Cleaned?
Italian	it	121	it_IT (Italy), it_CH (Switzerland), it_SM (San Marino), it_VA (Vatican City)	✓
German	de	95	de_DE (Germany), de_CH (Switzerland), de_AT (Austria), de_BE(Belgium), de_LI (Liechtenstein) , de_ LU(Luxembourg)	✓
Japanese	ja	81	ja_JP	✓
Swedish	sv	77	sv_SE (Sweden), sv_FI (Finland), sv_AX (Åland Islands)	✓
Russian	ru	76	ru_RU, ru_BY (Belarus), ru_KZ(Kazakhstan), ru_KG(Kyrgyzstan)	✓
Dutch (Netherlands)	nl	67	nl_NL, nl_BE(Belgium), nl_SR(Suriname), nl_ZA(South Africa)	✓
Hebrew (modern)	he	62	he_IL	✓
Portuguese (Portugal)	pt	57	pt_PT (Portugal), pt_BR(Brazil) , pt_AO(Angola), pt_GW(Guinea-Bissau), pt_CV(Cabo Verde), pt_GQ (Equatorial Guinea), pt_ST(São Tomé and Príncipe), pt_MZ(Mozambique), pt_TL(Timor-Leste)	✓
Polish	pl	49	pl_PL(Poland), pl_CZ(Czech Republic), pl_HU(Hungary), pl_LT(Lithuania), pl_RO(Romania) , pl_SK(Slovakia), pl_ UA (Ukraine)	✓
Norwegian Bokmål	nb	39	nb_NO	✓
Romanian, Moldavian, Moldovan	ro	37	ro_RO	✓
Danish	da	36	da_DK, da_GL (Greenland)	✓
Hungarian	hu	34	hu_HU	✓
Ukrainian	uk	32	uk_UA	✓
Norwegian	no	26	no_NO	✓
Slovak	sk	25	sk_SK	✓
Turkish	tr	24	tr_TR, tr_CY(Cyprus)	✓
Arabic (ar)	ar	23	ar_EG	✓
Latvian	lv	22	lv_LV	✓
Catalan; Valencian	ca	22	ca_ES	✓
Czech	cs	21	cs_CZ	✓
Persian (Iran)	fa	16	fa_IR	✓
Sardinian	sc	14	sc_IT (not active)	✓
qq_US	qq	13	???????? (invalid languange, update to en_US)	✓
Vietnamese	vi	13	vi_VN	✓
Bulgarian	bg	12	bg_BG	✓
Korean	ko	12	ko_KR	✓
Thai	th	10	th_TH	✓
Croatian	hr	8	hr_HR	✓
Serbian	sr	8	sr_RS	✓
Esperanto	eo	8	eo_XX	✓
Finnish	fi	8	fi_FI	✓
Greek, Modern	el	7	el_GR	✓
Lithuanian	lt	7	lt_LT	✓
English (United States)	en	7	en_US, en_AU, en_CA, en_GB, en_ZA, en_IN, en_SG	✓
Indonesian	id	7	id_ID	✓
yue_CN	yu	6	???????? (update to zh, and label as zh_HK which is for Cantonese)	✓
Georgian	ka	5	ka_GE	✓
Latin	la	5	la_VA	✓
Interlingua	ia	5	ia_XX	✓
Albanian (sq)	sq	4	sq_AL	✓
Basque (eu)	eu	4	eu_ES	✓
Urdu	ur	4	ur_PK	✓
Bengali (bn)	bn	4	bn_BD	✓
Hindi	hi	3	hi_IN	✓
Macedonian	mk	3	mk_MK	✓
Burmese	my	3	my_MM	✓
Bosnian	bs	3	bs_BA	✓
Spanish; Spain	es	3	es_ES, es_PR, es_MX	✓
Galician	gl	3	gl_ES	✓
Kurdish	ku	2	ku_IQ	✓
Panjabi, Punjabi	pa	2	pa_IN	✓
Chinese (China)	zh	2	zh_CN, zh_TW, zh_HK	✓
Telugu	te	2	te_IN	✓
Luxembourgish, Letzeburgesch	lb	2	lb_LU	✓
ba_JP	ba	2	ba_RU (not active) & ba_JP and ba_PT same as label	✓
Armenian (hy)	hy	2	hy_AM	✓
Tagalog	tl	2	tl_PH	✓
Belarusian (be)	be	2	be_BY	✓
Maltese	mt	2	mt_MT	✓
Scottish Gaelic; Gaelic	gd	2	gd_GB	✓
Marathi	mr	2	mr_IN	✓
Sinhala, Sinhalese	si	2	si_LK	✓
Estonian	et	2	et_EE	✓
Irish	ga	2	ga_IE	✓
Welsh	cy	2	cy_GB	✓
Nepali	ne	2	ne_NP	✓
ve_US	ve	2	ve_ZA (not active) & ve_IS and ve_IT same as label	✓
tg_IT	tg	2	tg_TJ (not active) & tg_IT and tg_CA same as label	✓
French (France)	fr	2	fr_FR, fr_CA	✓
Occitan (after 1500)	oc	2	oc_FR, oc_ES(Spain)	✓
Oriya	or	2	or_IN	✓
Twi	tw	2	tw_GH (not active) & tw_TW and tw_CA same as label	✓

73 rows in set (0.008 sec)

XenoRyet added a project: Fundraising Sprint Vwl Cnsrvtn.Nov 22 2022, 8:58 PM

AnnWF moved this task from Backlog to Doing on the Fundraising Sprint Vwl Cnsrvtn board.Nov 22 2022, 9:58 PM

AnnWF moved this task from Doing to Done on the Fundraising Sprint Vwl Cnsrvtn board.Nov 24 2022, 2:02 AM

XenoRyet closed this task as Resolved.Dec 6 2022, 8:06 PM

XenoRyet set Final Story Points to 8.

Clean up our five character locale data....Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Clean up our five character locale data....
Closed, ResolvedPublic
Actions

Related Objects
Search...