Page MenuHomePhabricator

Bad decoding of U+03B5 ε (epsilon)
Open, Needs TriagePublicBUG REPORT

Description

About U+03B5 ε GREEK SMALL LETTER EPSILON (ε ε), Lua mw.text.decode(), Lua mw.ustring.gsub().
Bug report at enwiki [https://en.wikipedia.org/w/index.php?title=Module_talk:DecodeEncode#epsilon]

The issue
After resolving HTML entity ε by mw.text.decode(), the plain character is _not found_ by mw.ustring.gsub(). No issue with alternative HTML entity ε.

Report limitations
Original discovery, report and bug reproduction is at enwiki, linked in top. There :en:module:DecodeEncode and :en:module:String are used live. No Lua patterns used (no "%"). Here at phabricator pseudocode is used and "results" may be hardcoded. In-text the & escape code is used.

Steps to replicate the issue:

  • 1. Create research string: Xε1Xε2X (shows live and unedited as: "Xε1Xε2X" as expected)
  • 2. Render the string by mw.text.decode(), inner function
  • 3. On rendered result use mw.ustring.gsub() to replace plain character "ε" with "E", outer function:

{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s=Xε1Xε2X}}|pattern=ε|replace=E|plain=true}}

Results

  • 4. (s&r pattern use "ε" from "Xε1X"): XE1Xε2X
  • 5. (s&r pattern use "ε" from "Xε2X"): XE1Xε2X

Expected
Only one character "ε" exists. I expect, all characters "ε" are equally replaced by "E": "XE1XE2X" (ok)

Workaround A ad hoc
In template code: add innermost function to _first_ replace in the research string "ε" into "ε"

{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s={{#invoke:String|replace|source=X&epsi;1X&epsilon;2X|pattern=&epsilon;|replace=&epsi;|plain=true}}}}|pattern=ε|replace=E|plain=true}}<

Result: "XE1XE2X" (ok)

Workaround B in module (THIN SPACE example)
Plan: early in the :en:module:DecodeEncode function, replace bad "&epsilon;" with good "&epsi;"
Current and proposed module/sandbox code at [https://en.wikipedia.org/wiki/Module_talk:DecodeEncode#Workaround_B]

About THIN SPACE: it looks like character U+2009 THIN SPACE (&thinsp; &ThinSpace;) has a similar issue.
Current live module code is addressing this:

s = mw.ustring.gsub( s, '&thinsp;', '&ThinSpace;' )

In the module/sandbox, I have added similar Lua code for epsilon:

s = mw.ustring.gsub( s, '&epsilon;', '&epsi;' )

  • /sandbox tests:

{{#invoke:String|replace|source={{#invoke:DecodeEncode/sandbox|decode|s=X&epsi;1X&epsilon;2X}}|pattern=ε|replace=E|plain=true}}<

Result B-1 (s&r pattern use ε from <code>Xε1X</code>): "XE1XE2X" (ok)
Result B-2 (s&r pattern use ε from <code>Xε2X</code>): "XE1XE2X" (ok)

This appears to solve the issue.

Workaround C in mw, Lua
Changes in mw, Lua: out of my league.

Event Timeline

DePiep updated the task description. (Show Details)
DePiep updated the task description. (Show Details)

Requested edit at enwiki executed 03:11, 19 February 2023‎ UTC
This report: considered moot for enwiki (workaround)