Page MenuHomePhabricator

Implement a volunteer-written wordlist to base the initial Engvar Edit Suggestion on
Closed, ResolvedPublic

Description

All Edit Check and Suggestions implement a policy and/or guidelines which volunteers have already reached alignment around.

In line with the above, this task involves the work of identifying what community-written wordlist the Editing Team can use as an initial basis for the Engvar Edit Suggestion (T413420).

For context, the current iteration of the Engvar Edit Suggestion (T413420) uses a wordlist the Editing Team compiled for demonstration purposes.

Word list

To be determined.
Feedback requested at https://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style/Spelling#Suggested_Edits_feature_-_wordlist

Done

  • The "Word list" section of this task contains a list of British-American English variations volunteers want to see substituted
  • The contents of the "word list" section of this task is implemented within the the Engvar Edit Suggestion (T413420)

Timeline

  • Clarify with team how quickly we expect to be able to get feedback from volunteers
  • Can we pick a short list which can go to production as a reasonable demo, which they can begin editing or expanding upon?

Event Timeline

ppelberg renamed this task from Consult with volunteers at en.wiki about what list to base initial Engvar Edit Suggestion on to Implement a volunteer-written wordlist to base the initial Engvar Edit Suggestion on.Jan 16 2026, 5:55 PM
ppelberg updated the task description. (Show Details)

The current mapping was derived as follows:

  1. Download en-US and en-GB wordlists from http://wordlist.aspell.net/
  2. Use a script to compare these lists, and build us_only.txt and gb_only.txt.
  3. Heuristically match words from the two lists with similar spellings
  4. Using this match list as a shortlist of likely candidates, manually create a one-way substitution mapping (en-US->en-GB)

The heuristic matching was done by trying the following spelling substitutions on US-only words and seeing if they match any GB-only word:

re.sub(r'ol', 'oul', us_word)
re.sub(r'or', 'our', us_word)
re.sub(r'er', 're', us_word)
re.sub(r'([lprst])(ing|ed|ers?)$', r'\1\1\2', us_word)
re.sub(r'se', 'ce', us_word)
re.sub(r'yz', 'ys', us_word)
re.sub(r'sk', 'sc', us_word)
re.sub(r'e(?=[acdgmos])', 'ae', us_word)
re.sub(r'e(?=[abs])', 'oe', us_word)
re.sub(r'll', 'l', us_word)
re.sub(r'og', 'ogue', us_word)

We might benefit from an initial small set of words that we can take to production which won't be too controversial (or difficult to review) while using the full list for feedback.

I have been considering whether it's worth sorting our proposed list by frequency to identify a smaller "starter" list, but it's unlikely to be worth the effort. I had a quick look around and found that the dataset on this github repo, which is licensed CC-by-sa-4.0. I believe this would be a licence-compatible dataset to use. However, we don't have a mechanism to attribute in the configuration files and I'm not sure if attributing in the commit would be sufficient.

Arguably a quick once-over for words I recognise, or a quick sort by length, might be a more efficient use of time. I'll collect feedback.

For future consideration, Wikidata has lexemes (e.g. color/colour) that note varieties of English. It might be possible (but perhaps difficult) to generate a list of conversions based on those.

We've implemented an initial wordlist and have equipped volunteers [i] with, what we think is, the know-how they need to edit it.


i. https://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style/Spelling#c-Quiddity_(WMF)-20260206021600-Quiddity_(WMF)-20260128212900