Page MenuHomePhabricator

Clean up our two-character locale data
Closed, ResolvedPublic

Description

When digging into T320821 we realised that over 500k of records have the invalid 'en' locale set. These records will throw an exception if they hit this code:

public static function negotiate(string $preferred): Locale {
  // Create a locale for the requested language
  if (!preg_match(';^[a-z][a-z]_[A-Z][A-Z]$;', $preferred)) {
    throw new \RuntimeException("Cannot instantiate malformed locale: $preferred");
  }

located here https://github.com/wikimedia/wikimedia-fundraising-crm/blob/3af6776ca72012cde04becf8085b97c409b58329/drupal/sites/all/modules/civicrm/Civi/Core/Locale.php#L175

Event Timeline

We should probably also check for any other locales which don't meet the regex requirement in the fn conditional

select preferred_language, count(*) FROM civicrm_contact WHERE LENGTH(preferred_language) < 4 group by preferred_language;

preferred_languagecount(*)
en597032
fr501
nl273

today the data updated to

preferred_languagecount(*)
en597014
fr501
nl273

select preferred_language, count(*) FROM civicrm_contact WHERE LENGTH(preferred_language) < 4 group by preferred_language;

preferred_languagecount(*)
en597032
fr501
nl273

Change 844047 had a related patch set uploaded (by Wfan; author: Wfan):

[wikimedia/fundraising/crm@master] Clean up historical two-character locale data to default one

https://gerrit.wikimedia.org/r/844047

@Wfan I looked at the patch & the query looks correct but there are two reasons I think we should schedule an api call instead

  1. the en query updates 500k rows at once - we don't know how long that will take / lock up the db for. We could test on staging or batch it but there is a second reason
  1. There is a phab https://phabricator.wikimedia.org/T308181 dealing with blank greeting fields. In testing I have determined that ANY update using the api will result in the greeting field being populated - so if we update these using the api we get a head start on that backfill too.

I have been testing this directly on the command line (5k rows at once seems not-insane but can be tweaked)

echo '{"values":{"preferred_language":"en_US"},"where":[["preferred_language","=","en"]],"limit":5000, "version":4}' | drush @wmff cvapi Contact.update --in=json

It is a bit painful to schedule the above in the yml but here is how major_gifts_addresses.yaml looks:

sh -c "echo '{\"group\":268,\"addressee\":{\"IS NULL\":1},\"contact_type\":\"Individual\",\"options\":{\"limit\":300},\"api.contact.create\":{\"addressee_id\":1,\"email_greeting_id\":1,\"postal_greeting_id\":1}}'| /usr/local/bin/drush @wmff -v --in=json cvapi Contact.get"

(side note that job that I just found in writing this ^^ is exactly what I was looking for).

What do you think?

Once it is complete we would need to unschedule the job (or start cleaning up a different wrong string) and delete the option value for 'en'

UPDATE - I just ran

echo '{"values":{"preferred_language":"fr_FR"},"where":[["preferred_language","=","fr"]],"limit":10000, "version":4}' | drush @wmff cvapi Contact.update --in=json

& they all cleaned up - so I will delete that option value

OK - I cleaned out nl too - so it is just en

Sounds reasonable and yes I agree that deadlock is a concern that I forget to worry and the second reason is also fair enough, we can wait until that job done, what schedule do you feel right for 5000 of those contacts to update without deadlock, and once we delete the en from option value, I will then abandon this patch, and close this ticket thanks @Eileenmcnaughton again for cleaning both fr and nl already~

wfan@frpm1001:~/localsettings$ cat process-control/civi1001/update_language.yaml

name: Update the language to be legit - eg en_US
description: API update to invalid languages, note only en & en_IT but syntax updatable to more
#schedule: "*/3 * * * *"
command:

sh -c "echo '{\"values\":{\"preferred_language\":\"en_US\"},\"where\":[[\"preferred_language\",\"IN\",[\"en\", \"en_IT\"]]],\"limit\":5000, \"version\":4}' | /usr/local/bin/drush @wmff cvapi Contact.update --in=json"

timeout: 5
allow_overtime: true

could be the job similar with the major_gifts_address.yaml

@Wfan I looked at the patch & the query looks correct but there are two reasons I think we should schedule an api call instead

  1. the en query updates 500k rows at once - we don't know how long that will take / lock up the db for. We could test on staging or batch it but there is a second reason
  1. There is a phab https://phabricator.wikimedia.org/T308181 dealing with blank greeting fields. In testing I have determined that ANY update using the api will result in the greeting field being populated - so if we update these using the api we get a head start on that backfill too.

I have been testing this directly on the command line (5k rows at once seems not-insane but can be tweaked)

echo '{"values":{"preferred_language":"en_US"},"where":[["preferred_language","=","en"]],"limit":5000, "version":4}' | drush @wmff cvapi Contact.update --in=json

It is a bit painful to schedule the above in the yml but here is how major_gifts_addresses.yaml looks:

sh -c "echo '{\"group\":268,\"addressee\":{\"IS NULL\":1},\"contact_type\":\"Individual\",\"options\":{\"limit\":300},\"api.contact.create\":{\"addressee_id\":1,\"email_greeting_id\":1,\"postal_greeting_id\":1}}'| /usr/local/bin/drush @wmff -v --in=json cvapi Contact.get"

(side note that job that I just found in writing this ^^ is exactly what I was looking for).

What do you think?

Once it is complete we would need to unschedule the job (or start cleaning up a different wrong string) and delete the option value for 'en'

OK - I cleaned out nl too - so it is just en

Change 844047 abandoned by Wfan:

[wikimedia/fundraising/crm@master] Clean up historical two-character locale data to default one

Reason:

done the same thing with cron job

https://gerrit.wikimedia.org/r/844047

after job run, we have below remaining while they all belongs deleted job
select preferred_language, count(*) FROM civicrm_contact WHERE LENGTH(preferred_language) < 4 group by preferred_language;

preferred_languagecount(*)
en10822
fr46
nl12

SELECT is_deleted, count(*) FROM civicrm_contact where LENGTH(preferred_language) < 4;

is_deletedcount(*)
110880

So it's ok to close the ticket and the remaining will be removed once we turn back the delete_deleted_job on.

XenoRyet set Final Story Points to 2.