Page MenuHomePhabricator

Convert between English language variants in display of pages
Open, LowPublic

Description

Author: jkinz

Description:
Submitted by Jeff Kinz

This is an idea to settle US vs UK spelling issue on WP.

Overview: Support both, have browser show the one desired by the viewer.

How it would work: For authoring pages:

AS = alternate spelling(s)

A notation like {UK:colour|US:color} for page input and editing. 

It can support as many alternate spellings,(AS), per word, as needed. 

AS words that are not in AS notation could be detected when changes
are submitted, and converted to AS notation automatically.

Its possible no AS notation is needed at all.  If servers can detect
all AS words upon submission, and set a flag on the page or set AS
notation on the AS words, then the server can present the preferred
spelling of the word on the page at output time**. (see below)

How it might work: For Displaying:

 Two possibilities.

 First method:  The WP server determines which AS the viewer wants
 and generates the page with that version of the spelling.

 Second mothod: Javascript in the page looks up the viewer's
 preference and alters the document to have the matching spelling.

Third method. Unless the viewer has a preference that overrides this, use IP address to geolocate reader and display likely preferred version.


 Summary:  Using both may be the most cost effective.

 1. Javascript in page looks for a preference cookie. Displays the page
    using the selected spelling style. 

 2. If there is no cookie yet, the Javascript displays the AS words 
    as clickable.  If viewer clicks on the word, a select spelling 
    style dialog is shown 


 2.  Determine preference
 3.  Set a cookie to last until end of viewers current session(s).
 4.  Page content is based on cookie preference.

RISK DUE to IGNORANCE-

Are there any English words which have two meanings, but only one
of those meanings has UK/US alternate spelling?

By this I mean the following: Assume a word 'A' which has two
different meanings: A-1 and A-2. 

For meaning A-1, A is spelled 'A' in both US and UK spelling.  
For meaning A-2 the UK spelling is still 'A'
For meaning A-2 the US spelling is "A#".

Bonnet and Hood are types of headgear across the English speaking world. But they are also a part of the car, the bit you open and lift up to see the engine. To complicate matters Hood in US English has a third meaning, an abbreviation of neighbourhood with a derogatory implication. Hood is also a rare surname, but there was an Admiral Hood who had a British Battlecruiser named after him, and that ship is very famous.

Fag has two different meanings in Britain, and a very different meaning in the US. Ditto Faggot, though the two British meanings are not obviously to the word Fag.


If word A is in a page, meaning A-1, and that page is processed for
display with US alternate spelling, the A is changed to A# thereby
changing the meaning from A-1 to A-2.

I don't know if any such words exist, but they may.  If any such words 
exist then this idea cannot be used.  The problem of determining semantic
word meanings from context is only partially solved by Bayesian analysis 
or hidden Markov chains. And both are expensive to calculate while neither
produces human level quality answers.


If no such words exist, then this solution is a viable one. 

A word with one spelling but multiple meanings: "read" . It can mean
"I will read the manual." or it can mean "I have read the manual."
The first is pronounced like "reed".  The second is pronounced like
"red".

One word, one spelling, two different meanings, two different
pronunciations.

Worse, the phrase "I read the book" can use either meaning of the
word. So programatically deciding which meaning of a word the
sentence is using is not workable.  In this case it doesn't matter
which meaning is used because both are spelled the same.  But that
may not be true for all AS words.


Here is a contrived example of a word with 2 meanings, whose spelling
changes in one UK/US spelling style:

Assume the word "blew" has two meanings, both are past tense.

#1 - To strike, hit.  "I blew him down."       (I struck him down)
#2 - blowing air.     "I blew across the cup." (I breathed across the cup)

Assume that in US spelling meaning #2 is spelled bloo while meaning
#1 is spelled the same way, "blew", in both UK and US.

In converting #1 above from UK to US spelling the meaning would change:

"I blew him down"     ( I struck him down )
"I bloo him down."    ( I breathed him down )   {have a mint, fella!}
 
Because the server cannot determine which meaning the UK version has,
it cannot accurately determine which word to display for a US page.
 

Conclusion: 

If it can be determined that there are no English words that 
fit the scenario above, this idea can be used. Otherwise, it cannot.
  • Or a bot can scan the WP database and set the flag on any pages with any AS words on them.

This email partially created with "Dragon Naturally Speaking" speech
recognition system. A tool I'm proud to have worked on.
Note: the email may have incorrectly transcribed content.

Jeff Kinz, Emergent Design. "Carpe Diem!"

"Piscis Carpe" ->"Fish the Seize"

See Also:

Details

Reference
bz31015

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:53 PM
bzimport set Reference to bz31015.
bzimport added a subscriber: Unknown Object (MLST).

This could probably be done by making a LanguageConverter for English (I had actually made this some time ago, but there might be issues with the fallback system).
In that case JavaScript is not needed for this.

thor.malmjursson wrote:

Seems like an exceptionally bright idea to settle one of the longest running
age old arguments on Wiki - "Whose spelling is right?" - Personally, I don't
mind since English is English, and I regularly mix the two anyhow. I'm voting
for this.

jkinz wrote:

One additional note:

If editors enter the UK/US alternates spelling for each instance of an AS word, then this idea can be used even if there are English words which produce the scenario described under RISK DUE to IGNORANCE

jkinz wrote:

And another:

Another way to defeat the RDI (Risk Due to Ignorance) scenario is to scan each change submission for new instances of AS words and generate a dialog box for the editor to select the correct alternate spellings.

If the meaning of the word being used has no alternate spelling, then the editor selects "no alternate" and that instance is not tagged as an AS word so the RDI scenario never happens.

If the meaning of the word being used does have an alternate spelling across the pond, then the editor selects that alternate from the list and that alternate is kept in the page so the AS selection code has a human expert decision to relay on. Once again - no RDI scenario.

You may also want to take a look on the following proposal of using the LanguageConverter on Wikisources for modernization of old texts:
http://wikisource.org/wiki/Wikisource:Scriptorium/Archives/Jan_2010_-_Dec_2010#Using_LanguageConverter_syntax_at_Wikisources

There is also an JavaScript being used on some Wikisources for modernization of old texts. See

  • [[MediaWiki talk:Modernisation.js]]
  • [[fr:s:Wikisource:Scriptorium/Janvier_2011#New_version_of_script_for_modernization]]

It was also adapted so that it could be used as a "Language Converter" on Portuguese Wikipedia (since T28121 is still open).

There are instructions for using it as a user script to deal with English variants on English Wikipedia. See:
[[Wikipedia:WikiProject_User_scripts/Scripts/Language_Converter]]

(In reply to comment #0)

Two possibilities.

First method:  The WP server determines which AS the viewer wants
and generates the page with that version of the spelling.

I don't think this can be done because of caching.

(In reply to comment #6)

(In reply to comment #0)

Two possibilities.

First method:  The WP server determines which AS the viewer wants
and generates the page with that version of the spelling.

I don't think this can be done because of caching.

This is already done on zhwiki.

I strongly support this idea, which is controversial, I know. I believe that the only way that LanguageConverter can be properly supported is if it is more universally enabled. We need to make the variant system work well enough that "even English" uses it -- and then it will also be working well enough for our wikipedias which have no choice but to enable it, like zhwiki, serbian wiki, etc. "Dogfooding" LanguageConverter would make maintenance much easier.

One step toward this goal is T45547... but ultimately it would be best to support en-gb/en-us,etc as variants as well.

One important point about it being universal would be to support the use of LanguageConverter in interface messages. Then it could also be considered for https://www.mediawiki.org/wiki/Internationalisation_wishlist_2017#Better_support_for_formal_and_informal_variants

There are words with radically different meanings in English and American English, Fanny, Fag, Bum, Pants and of course Spunk/Spunky. There are also lots of words with secondary meanings that are different even if the main meaning is the same. Hood and Bonnet are both types of headgear, but they are also the same part of the car on different sides of the pond. Table is the same piece of furniture, but to table something has opposite meanings.

So if we offer a choice we will need to find a way to mark which meaning certain words have.

On the other hand we should remember the big reason for doing this, our readership in the US is lower than in other parts of the English speaking world, and American unfamiliarity with Non-American spelling is the only plausible theory yet mooted for this.

Existing syntax for special conversion rule have been documented at https://www.mediawiki.org/wiki/Writing_systems/Syntax . Some of the concern could have been addressed in the link and there are also no need to reinvent new syntax instead of using a syntax that have already established.

In the Chinese conversion system there are also a number of one-to-many conversion pairs and they are mostly solved by adding them into conversion group together with neighboring characters that they're likely to appears together with, and then the above syntax is used for individual cases that cannot be covered by the general conversion rule. Additionally, conversion groups are also created, so that when there are specific meaning of a word that are more likely to appear in a single topic (or theme or series or works), then the conversion group can be applied to the topic and handle those words in the group specially.