Page MenuHomePhabricator

Investigation: Update country field in event registration configuration to be a structured list
Closed, ResolvedPublic

Description

User stories:

As an organizer of an in person or hybrid event, I want to be able to indicate the country where it will be taking place, so people who live in or plan to visit the country may more easily find the event and potentially join it.

As a contributor, I want to know which in person or hybrid events are occurring in my country, so that I can find events that I am more likely to be able to attend in person.

As a member of the Campaigns team, I want to know the country of in person and hybrid events, so that we can determine which events are taking place in countries with higher sensitivities under the Country & Territory Protection List.

Background:

We have learned from survey findings that many people would like to have a search filter in the Collaboration List to search for events by country. This is even more important now that we allow transclusion of the Collaboration List, since the inclusion of a country filter could allow people to create event calendars for a specific country.

Meanwhile, for our upcoming project to track collaborative contributions (T378035), we plan to offer special protective measures for in person or hybrid events that take place in countries that are Higher Risk and Not Published in the Country & Territory Protection List. For these reasons, we would like allow a way for organizers of in person & hybrid events to indicate the country through a structured list (rather than the current plaintext field). We will use the full canonical country list.

Ordering:

How do we structure and sort the country list? For example, should it be completely alphabetized? Should we have headers by continent?

Limitations:
  • Organizers should only be able to pick 1 country - maybe we will allow more than 1 later, but this is beyond the scope for the MVP
    1. Acceptance Criteria:
  • Determine how we can update the country field in event registration configuration to be a structured list and share findings with team on recommendations/next steps
  • The countries/regions below should be removed from the list of countries returned by CLDR:
['CP' => 'Clipperton Island'],
['CQ' => 'Sark'],
['DG' => 'Diego Garcia'],
['EA' => 'Ceuta & Melilla'],
['EU' => 'European Union'],
['EZ' => 'Eurozone'],
['IC' => 'Canary Islands'],
['QO' => 'Outlying Oceania'],
['TA' => 'Tristan da Cunha'],
['UN' => 'UN'],
['XA' => 'Pseudo-Accents'],
['XB' => 'Pseudo-Bidi'] (edited)

Event Timeline

We can use CLDR extension: https://www.mediawiki.org/wiki/Extension:CLDR#Country_names, to get the list of countries, its return is as below and has support for internationalization, we can pass the language code to get it in any language:

Array
(
    [aa] => Afar
    [aae] => Arbëresh
    [ab] => Abkhazian
    [abs] => Ambonese Malay
    [ace] => Acehnese
    [acf] => Saint Lucian Creole
    [acm] => Iraqi Arabic
     ......

It does not return the country region or continent, its return is as the example above.

The list above has 263 items 12 more than the canonical list, all the ones in the canonical list [251] are also in this list, the 12 more are:

  • ['CP' => 'Clipperton Island'],
  • ['CQ' => 'Sark'],
  • ['DG' => 'Diego Garcia'],
  • ['EA' => 'Ceuta & Melilla'],
  • ['EU' => 'European Union'],
  • ['EZ' => 'Eurozone'],
  • ['IC' => 'Canary Islands'],
  • ['QO' => 'Outlying Oceania'],
  • ['TA' => 'Tristan da Cunha'],
  • ['UN' => 'UN'],
  • ['XA' => 'Pseudo-Accents'],
  • ['XB' => 'Pseudo-Bidi'] (edited)

cc: @ifried

About the tech details on how to implement it, we will need to:

1 - Store the country abbreviation e.g ( aae for Arbëresh ), the DB column is already a string so we can keep it as it is or make it shorter, I would leave it as it is since we have events with country data as string
2 - Use CLDR extension to get the list of countries
3 - Create a class to remove 12 unneeded countries from the list.
4 - Change the front end to be dropdown list when choosing the country

Note: What to do with events that have filled country as free text? @ifried

Some notes from team meeting on June 2 2025:

  • We can use the CLDR extension, which has a list of countries that are not mapped to continent
  • We can probably exclude the countries that are not in the canonical list, but Ilana will take an extra pass to investigate
  • As for how we handle older countries stored as a free text:
    • Can we map some of the older countries to countries in the CLDR extension?
    • We need some way of handling default/blank countries for a) online/hybrid events that did not have country in the past, and b) online/hybrid events that cannot have their old country translated to the new country - this can be a part of the investigation to look into this
    • VPM: I recommend that we clean the data, so that it is more useful for data analysts

I agree that using CLDR seems the best way to do this.

1 - Store the country abbreviation e.g ( aae for Arbëresh ), the DB column is already a string so we can keep it as it is or make it shorter, I would leave it as it is since we have events with country data as string

Reminder that we have a cea_full_address column that is basically a huge hack to store address and country together. This will likely also need to be cleaned up at the same time.

  • As for how we handle older countries stored as a free text:
    • Can we map some of the older countries to countries in the CLDR extension?

"Some", yes. But surely there are going to be values that we can't automatically replace. It's not clear what we'd do about those.

  • We need some way of handling default/blank countries for a) online/hybrid events that did not have country in the past, and b) online/hybrid events that cannot have their old country translated to the new country - this can be a part of the investigation to look into this

Note that the country field is not required, even for in-person event. The rationale was that organizers might create an event before finalizing the venue. We can lift this restriction, maybe just for the country field (and leave it for the address field), if we want to.

  • VPM: I recommend that we clean the data, so that it is more useful for data analysts

I agree, although going from free text to structured is never pleasant...

Also, a couple more things:

  • Being able to translate the country is a good thing on its own, but watch out when displaying it together with the address. The address is free text and not translated, so the two might be in different languages and scripts (and possibly directionality). I'm not sure how bad this would be, and what to do about it.
  • When we do geocoding (T316126), this will go back to being a freetext field (user would enter the full address, including country, as a single string). Not entirely sure if this has any implications on the current work.

Just a note, these are the 12 "Countries" that exist on CLDR and not on https://gitlab.wikimedia.org/repos/movement-insights/canonical-data/-/blob/main/country/countries.tsv?ref_type=heads
All the other ones that are in countries.tsv exist in CLDR, the only difference is that CLDR have this 12 more below:

['CP' => 'Clipperton Island'],
['CQ' => 'Sark'],
['DG' => 'Diego Garcia'],
['EA' => 'Ceuta & Melilla'],
['EU' => 'European Union'],
['EZ' => 'Eurozone'],
['IC' => 'Canary Islands'],
['QO' => 'Outlying Oceania'],
['TA' => 'Tristan da Cunha'],
['UN' => 'UN'],
['XA' => 'Pseudo-Accents'],
['XB' => 'Pseudo-Bidi']

My recommendation for when we implement the new country list would be to follow one of the options below — or possibly a combination of options 1 and 2 during the migration period (if we decide to migrate the old data to the new 'Country Abbreviation' field):

  1. For countries entered as free text (i.e., the ones we currently have), if a value exists, we show it in the dropdown list as the selected option exactly as it appears in the database. This means these values will not be translated.
    • This approach allows both the new and old versions to work in parallel.
  2. Create a script to update the old values to the new ones
    • This would require us to create a hardcoded list with all the current ones we have and its new value, so we can update them in the database
  3. Alternatively, we could choose not to update the old values. This would mean those events wouldn’t be filterable by country once the new country field is implemented. It also means we won’t be able to generate country-based reports for events that still use the old version.

Also, this can be done in 2 parts, doing 1 first and 2 whenever we want/need to.

This way we would not be blocked by the fact that there are old values stored as free text, and there is also space to create the script to update the current ones to the new values later

For the cea_full_address as @Daimona said, it is a hack to store address and country together, and we also need to clean it at the same time.

  1. For countries entered as free text (i.e., the ones we currently have), if a value exists, we show it in the dropdown list as the selected option exactly as it appears in the database.

A more common approach in MW would be to use a "selectorother" field that combines a dropdown and a text field, similar to the "Revision ID or difference" field on Special:Diff. But that's just a UI difference, the general approach proposed remains valid.

  1. Create a script to update the old values to the new ones

This approach alone is not viable. As mentioned above and in a previous meeting, there are values that are simply not countries (or not a single country) and there'd be no way to map them.

  • This would require us to create a hardcoded list with all the current ones we have and its new value, so we can update them in the database

I thought we could do it without hardcoding? For each available language, iterate over all localised country names, and if there is a match, do the mapping. For a one-off maintenance script, I think that's more than OK. But of course, it might expose us to subtle mistakes: say, if there are multiple localised matches for a single source value; or if a source value happens to match a translation for a different country. But I suppose these would be rare and could be reviewed manually if needed.

  1. Alternatively, we could choose not to update the old values. This would mean those events wouldn’t be filterable by country once the new country field is implemented. It also means we won’t be able to generate country-based reports for events that still use the old version.

I don't fully understand how this is different from option 1. I think they are complementary, as they deal with different aspects (selector UI vs filterability and generating reports).

One thing to keep in mind though, is how we represent free-text countries in the database. Currently we have a single column; mixing free-text values and country codes in the same column feels wrong to me. How we address this would also probably reflect in the ability to filter the data and generate reports.

And in general, I agree that we should probably look for a combination of these options, that can also be done at different points in time.

I thought we could do it without hardcoding? For each available language, iterate over all localised country names, and if there is a match, do the mapping.

For this we will also need some degree of normalization on the inputs. We can force lowercase, and we should also normalize accents / special characters. I think we can use iconv, and if not, there's https://www.mediawiki.org/wiki/Equivset.

Thanks for these comments, it sounds ok for me, as next steps I will create subtasks for the Epic, fell free to add more if I missed any, and also improve the AC if needed, @Daimona , @ifried, @MHorsey-WMF, @VPuffetMichel

  • Create a new column at the DB to store the country code
  • Create a script to update the current ones to the new values, Although we still have to do a final decision on what to do with invalid ones, since the country field will be a required field
  • Front end, add the new country drop-down field
  • Back end store the country code on the new field
  • Show country code on event details
  • Show country code on event details modal
  • Add the ability to filter by country on collaboration list
  • Update APIs to handle the new field
  • Show country on collaboration list

T397269
T397270
T397271
T397273
T397274
T397275
T397276
T397277
T397278

Note: we will also need to look for (in-person and hybrid) events that have no corresponding row in ce_address.

This investigation is complete