Following on from our clean up of 2 character locale data we should clean up our 5 character locale data.
This is mostly the same from a technical pov - ie we can add more locales to the job we have already scheduled. There is a small QA task in there however.
The data I'm referring to is the option values that are not real languages - in theory the languages should map to actual languages the users have accessed. However, historically our code just added together the language string 'en' and the country string 'DK' to get 'en_DK' - since that wasn't in the database it was just added.
Later we made it so that it would not add new languages but rather come up with a reaslistic fallback - ie 'en_US' since that is the language we actually send emails in.
However, when we look at our language variants there are two types - 'real ones' and 'made up ones' . Since we fall back to 'en_US' for all of them currently anyway I don't think we need to be too careful about the 'real ones' - but I think we should at least 'legitimise' obvious real ones - such as 'en_NZ' "English (New Zealand)" and 'en_IN' "English (India)" which we know to be official languages of the respective countries without too much research. I added an upstream gitlab for this too https://lab.civicrm.org/dev/core/-/issues/3928
For the made up ones - we should fix the contact languages, using the process control method we used for the two letter ones & remove the option
