Page MenuHomePhabricator

Search and Replace is replacing an extra character for some words - Sinhala wiki
Closed, ResolvedPublic

Description

Author: wikibugs

Description:
Screen print of the error

Reporting against Babaco Release : r57957

Steps to Reproduce ::
Link : http://prototype.wikimedia.org/si.wikipedia.org/%E0%B6%B8%E0%B7%94%E0%B6%BD%E0%B7%8A_%E0%B6%B4%E0%B7%92%E0%B6%A7%E0%B7%94%E0%B7%80

1)Select a random page
2)Edit a section
3)Select a word and select a replace word
4)Replace
<<Extra character is added>>

Expected Outcome::
There should not be any extra character

Test Environment::
Browser (User-Agent): Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.0 (KHTML, like Gecko)Chrome/3.0.195.27 Safari/532.0

Browser (User-Agent): Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

Browser (User-Agent): Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3


Version: unspecified
Severity: major
Platform: PC

Attached:

Details

Reference
bz21228

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:51 PM
bzimport set Reference to bz21228.

My gut says this is probably due to a bad interaction between regexes and multibyte strings; if that's the case, we can't do much about it.

Basically what I think is happening is that the [^ ] part of the regex is selecting one byte, but the character at that position is really two (or more) bytes long. That one byte will be matched and replaced, but the second (and any subsequent) bytes will stick around and be interpreted as a different character. I'll try to confirm this suspicion later.

The suspicion in comment #1 doesn't seem to be right, so now I think this may have something to do with compound characters. Could you paste all texts from the PDF (textarea contents before, search regex, replace string, textarea contents after) in a bug comment?

The underlying search and replace code is completely different now that we are using an iframe rather than a textarea.

(In reply to comment #3)

The underlying search and replace code is completely different now that we are
using an iframe rather than a textarea.

That doesn't necessarily mean that multibyte character handling is magically fixed. Reopening and asking Calcey to try and reproduce again; please close as FIXED or WORKSFORME if this can't be reproduced any more.

I've tested this with double-byte characters quite a bit now, and am sure it's fixed.

Note that Sinhala seems to be using three-byte characters.

wikibugs wrote:

Verified and closed