Page MenuHomePhabricator

GWT created a page that had "�" in the text
Closed, DeclinedPublic

Description

https://commons.wikimedia.org/w/index.php?oldid=122700114

I'm not sure if this is because the source material was bad, but I'd appreciate a double-check on this, I'd hate for our servers to be spitting out pages that indicate poor Unicode support.

Event Timeline

MarkTraceur assigned this task to dan-nl.
MarkTraceur raised the priority of this task from to Medium.
MarkTraceur updated the task description. (Show Details)
MarkTraceur subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@MarkTraceur,
it looks like there was a problem with the original metadata that fæ cleaned-up; see this revision comparison. but i’d double-check with fæ to make sure. not sure how to add @fæ to this task. will @Fae do it?

Yeah, this is not a GWT "bug" but is likely to be a stumbling block for users. The source website (NPYL) had catalogue pages with inconsistent character encoding, probably because they had been created at different times from different original systems before being put on-line in one database. In some cases the html meta tag actually contradicted the encoded content of the page. As a result, even though I had test runs of several hundred descriptions, this still caught me out and was a complex bit of "housekeeping".

I find this a horrid problem, it might be an idea for the manual to recommend tools (like iconv) and provide some examples of how to correct source text files so that they are all "UTF-8 ready" before feeding into GWT.

The HTML source contains the byte sequence EF BF BD, that is, a proper replacement character; it's not the browser that transforms an invalid byte sequence into a replacement mark. It would be easy to set up an abusefilter rule to catch edits which add such characters so errors of this kind don't go unnoticed.

It would be interesting to know at what point exactly the original byte sequence is transformed (it's probably libxml in GWT that does it).

As for fixing manually, chardet claims to autodetect the real encoding with some sophistication.

In T90887#1072134, @Tgr wrote:

It would be easy to set up an abusefilter rule to catch edits which add such characters so errors of this kind don't go unnoticed.

AbuseFilter ignores wikitext during upload: T89252

Aklapper added a subscriber: dan-nl.

@dan-nl: I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!