Page MenuHomePhabricator

[Migrated] AWB cannot save letter + combining diacritic when a precomposed Unicode glyph is available
Open, LowPublic

Description

kwami (talk) 08:24, 2 October 2010 (UTC) wrote:

I'm making corrections to 800 IPA transcriptions of Burmese. Some of these involve changing a diacritic. Many of the vowel+diacritic have no combined Unicode value, and so I made the rules a generic combining-diacriticcombining-diacritic swap. However, some of the combos do have preexisting glyphs, and when saved, WP converts the combos into those glyphs. AWB cannot save a combo that this happens to. Although it displays the proper corrections in the edit box, when I hit save, it just restarts (in X seconds).

I can copy the edit box and paste it into any of the articles manually (I've done about 70), and that works fine. That gets to be quite tedious, however.

I reported a similar problem some time ago, here, and that was chalked up to a server problem. However, it's been two days now w the Burmese stuff, and it's still happening. It's also always the same subset of articles.


Duplicate: One example replaces a combining under-breve ( ̯ ) with a combining under-tilde ( ̰ ). I made the manual change, cut & pasted from the AWB edit window, here. Note that although I copied and pasted from the edit window in AWB, where the diacritics were separate combining glyphs, and that's still the case in the saved version in the page history with a̰, the page history has a precomposed ṵ, which was not produced by AWB. I reverted that change, ran it again, and deleted the under-tildas, and it saved fine here. It also saves fine if I delete only the tilde under the u, http://en.wikipedia.org/w/index.php?title=Thukha&action=historysubmit&diff=388246382&oldid=388246257 but not if I delete only the tilde under the a.

That particular rule is regex under advanced rules, \{\{IPA-my\|([^|}]*)̯ to {{IPA-my|$1̰ (regex, case sensitive, apply 3 times, inside templates, no 'if' conditions). However, the same thing happens with the same diacritic in a 'regular settings' rule that changes ṵ (precomposed u-under.tilde) to ṵ (u + combining diacritic--I'm telling you in case the latter gets saved as the precomposed character when I hit 'save page' on this post, which I believe it will) under the 'normal settings' rules, and also the same thing with i instead of u. No boxes in the regular rules window are checked apart from 'enabled', but same problem as the regex rule.

A different diacritic I manually overrode was here. The problematic part was correcting taʊ̀ɴ to tàuɴ. (That accent probably will be fused to the a when I save this posting, but in the edit window it is a separate combining glyph which I can delete by hitting the backspace key.) It only saves if I delete the grave accent over the a (that is, save to tauɴ http://en.wikipedia.org/w/index.php?title=Three_Pagodas_Pass&diff=next&oldid=388241313). That's a 'normal settings' rule that finds aʊ̀ and replaces with àu. There is another rule than replaces unaccented ʊ with u in certain environments, and that doesn't cause problems, so it's not the ʊ → u part. Also, when I replaced the combining accent à (which is easier for me to type inside AWB) with precomposed à inside the replace rule, then the problem disappears, as here.

So I figure it's the combining diacritic. I replaced the problematic ṵ in first problem listed above with ʊ̰ (same diacritic on a letter which has no precombined Unicode character for it) and it saved just fine, here just as it did when I cut & pasted in the precomposed letter into the AWB edit box.http://en.wikipedia.org/w/index.php?title=Supayalat&diff=prev&oldid=388244621

So it would seem to be specifically (1) trying to save a page with a letter plus combining diacritic sequence, when that combination would normally be converted into a precomposed character when saved in WP, but not (2) when saving the precomposed character itself, or (3) when saving the same diacritic on a letter for which Unicode (or at least WP) does not have a precomposed version.

OS: Win7
.NET: 2.0.50727.4952
Version: 5.0.3.0
Workaround: Create separate rules for every precomposed letter-diacritic combination, or cut and paste from the edit window

Event Timeline

Josve05a raised the priority of this task from to Needs Triage.
Josve05a updated the task description. (Show Details)
Josve05a added a project: AutoWikiBrowser.
Josve05a subscribed.

@Rjwilmsi 12:58, 2 October 2010 (UTC) wrote:

Does this sandbox diff correctly represent the type of edit you want to be making with AWB (text from Thukha article)?

kwami (talk) 19:15, 2 October 2010 (UTC) wrote:

Yes, that's exactly right. I can get it to work if AWB converts u̯ directly into ṵ, but not if it converts into u+ ̰

@Rjwilmsi 21:52, 2 October 2010 (UTC) wrote:

I was able to reproduce this once, but now can't get it to happen again. If I have understood your conclusion it's that the diacritic needs to be combined with the letter to the Unicode character for AWB to save. Therefore I suggest you try selecting the text and right-clicking to choose the 'Unicodify selected' option. Does it then allow you to save?

kwami (talk) 22:04, 2 October 2010 (UTC) wrote:

It was already selected, but it made no difference: I can't save either way. (I assume that unicodifies input from existing sequences on the page, not output from AWB.)

Yes, that was my conclusion. I think it might only be sequences that WP unicodifies upon saving, that there's a conflict between what AWB and WP are trying to do.

My AWB options: apply gen fixes, unicodify, settings enabled (as above).

kwami (talk) 08:05, 3 October 2010 (UTC) wrote:

Happening again with a different set of rules here. In this case, I've got 1200 preparsed pages in the list, though most won't have this problem.

@Magioladitis 16:41, 25 January 2014 (UTC) wrote:

kwami: We resolved this by skipping pages with characters in PUA. Am I right or wrong?

kwami (talk) 19:20, 25 January 2014 (UTC) wrote:

@Magioladitis: No, this has nothing to do with PUA. It's a matter of replacing combining diacritics with precomposed glyphs within the defined Unicode range.

kwami (talk) 19:48, 25 January 2014 (UTC) wrote:

Just ran a test, and it's still a problem. I simplified it to replacing u with $1̰ in my sandbox. If I run it on a instead of u, it saves properly. (There is no precomposed glyph with a.) If I replace u but copy the result from the AWB edit window and paste it in manually, that also saves. But otherwise AWB skips the article with the log error (Filter by skip reason & Filter exclude skip reason) "MD5 hash error: The page you are editing may contain an unsupported or invalid Unicode character". That's clearly not true: The only thing in my sandbox was a u a u, and AWB will save normally when the diacritic is on a. (Here's where it saved when I manually deleted the diacritics from u but left them on a.) So all three characters are valid unicode. What appears to be happening is that there's confusion with the precomposed Unicode glyph U+1E75 "". (Here's where I manually pasted the AWB edit-window result into the article, and that character is what the u+tilde saved as.)

So this is what I conclude the problem is: If there is a precomposed unicode glyph C that is equivalent to letter A plus combining diacritic B, then AWB will skip an article rather than saving the sequence AB. WP has no problem saving an article with AB in it, but in so doing it replaces it with C. I suspect that there is a conflict between AWB and that feature in WP.